0% found this document useful (0 votes)

240 views

Random Phenomena Text

This document provides an overview and preface for a textbook on probability and statistics for engineering applications. It discusses: 1) The scope and organization of the textbook, which is organized into 5 parts covering foundational concepts, probability, probability models, statistics, and special topics. It is intended to cover a two-semester sequence or a one-semester graduate course. 2) Key features of the textbook including its emphasis on fundamental principles, derivation of probability models from underlying mechanisms, inclusion of examples and case studies, and use of computer software. 3) Suggestions for how the material could be covered for different audiences including a two-semester undergraduate sequence or a one-semester graduate course

Uploaded by

leapoffaith

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

240 views

Random Phenomena Text

Uploaded by

leapoffaith

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1063

Babatunde A.

Ogunnaike

Random Phenomena
Fundamentals and Engineering Applications of
Probability & Statistics

Random Phenomena
Fundamentals and Engineering Applications of Probability & Statistics

I frame no hypothesis; for whatever is not deduced from the phenomenon is to be called a hypothesis; and hypotheses, whether
metaphysical or physical, whether of occult qualities or mechanical, have no place in experimental philosophy.
Sir Isaac Newton (16421727)

In Memoriam

In acknowledgement of the debt of birth I can never repay, I humbly dedicate

this book to the memory of my father, my mother, and my statistics mentor
at Wisconsin.

Adesijibomi Ogunde.ro. Ogunnaike

19222002
Some men y as eagles free
But few with grace to the same degree
as when you rise upward to y
to avoid sparrows in a crowded sky

Ayo.o.la Oluronke. Ogunnaike

19312005
Some who only search for silver and gold
Soon nd what they cannot hold;
You searched after Gods own heart,
and left behind, too soon, your pilgrims chart

William Gordon Hunter

1937-1986
See what ripples great teachers make
With but one inspiring nger
Touching, once, the young minds lake

Preface

In an age characterized by the democratization of quantication, where data

about every conceivable phenomenon is available somewhere and easily accessible from just about anywhere, it is becoming just as important that the
educated person also be conversant with how to handle data, and be able
to understand what the data say as well as what they dont say. Of course,
this has always been true of scientists and engineersindividuals whose profession requires them to be involved in the acquisition, analysis, interpretation
and exploitation of data in one form or another; but it is even more so now.
Engineers now work in non-traditional areas ranging from molecular biology
to nance; physicists work with material scientists and economists; and the
problems to be solved continue to widen in scope, becoming more interdisciplinary as the traditional disciplinary boundaries disappear altogether or are
being restructured.

In writing this book, I have been particularly cognizant of these basic facts
of 21st century science and engineering. And yet while most scientists and engineers are well-trained in problem formulation and problem solving when all
the entities involved are considered deterministic in character, many remain
uncomfortable with problems involving random variations, if such problems
cannot be idealized and reduced to the more familiar deterministic types.
Even after going through the usual one-semester course in Engineering Statistics, the discomfort persists. Of all the reasons for this circumstance, the most
compelling is this: most of these students tend to perceive their training in
statistics more as a set of instructions on what to do and how to do it, than as
a training in fundamental principles of random phenomena. Such students are
then uncomfortable when they encounter problems that are not quite similar
to those covered in class; they lack the fundamentals to attack new and unfamiliar problems. The purpose of this book is to address this issue directly by
presenting basic fundamental principles, methods, and tools for formulating
and solving engineering problems that involve randomly varying phenomena.
The premise is that by emphasizing fundamentals and basic principles, and
then illustrating these with examples, the reader will be better equipped to
deal with a range of problems wider than that explicitly covered in the book.
This important point is expanded further in Chapter 0.
iii

Scope and Organization

Developing a textbook that will achieve the objective stated above poses
the usual challenge of balancing breadth and depthan optimization problem
with no unique solution. But there is also the additional constraint that the
curriculum in most programs can usually only accommodate a one-semester
course in engineering statisticsif they can nd space for it at all. As all teachers of this material know, nding a universally acceptable compromise solution
is impossible. What this text oers is enough material for a two-semester introductory sequence in probability and statistics for scientists and engineers,
and with it, the exibility of several options for using the material. We envisage the following three categories, for which more detailed recommendations
for coverage will be provided shortly:
Category I: The two-semester undergraduate sequence;
Category II: The traditional one-semester undergraduate course;
Category III: The one-semester beginning graduate course.
The material has been tested and rened over more than a decade, in the
classroom (at the University of Delaware; at the African University of Science
and Technology (AUST), in Abuja, Nigeria; at the African Institute of Mathematical Sciences (AIMS) in Muizenberg, South Africa), and in short courses
presented to industrial participants at DuPont, W. L. Gore, SIEMENS, the
Food and Drugs Administration (FDA), and many others through the University of Delawares Engineering Outreach program.
The book is organized into 5 parts, after a brief prelude in Chapter 0
where the books organizing principles are expounded. Part I (Chapters 1 and
2) provides foundational material for understanding the fundamental nature
of random variability. Part II (Chapters 37) focuses on probability. Chapter
3 introduces the fundamentals of probability theory, and Chapters 4 and 5
extend these to the concept of the random variable and its distribution, for
the single and the multidimensional random variable. Chapter 6 is devoted to
random variable transformations, and Chapter 7 contains the rst of a trilogy
of case studies, this one devoted to two problems with substantial historical
signicance.
Part III (Chapters 811) is devoted entirely to developing and analyzing
probability models for specic random variables. The distinguishing characteristics of the presentation in Chapters 8 and 9, respectively for discrete and
continuous random variables, is that each model is developed from underlying
phenomenological mechanisms. Chapter 10 introduces the idea of information
and entropy as an alternative means of determining appropriate probability
models when only partial knowledge is available about the random variable in
question. Chapter 11 presents the second case study, on in-vitro fertilization
(IVF), as an application of probability models. The chapter illustrates the
development, validation, and use of probability modeling on a contemporary
problem with signicant practical implications.

v
The core of statistics is presented in Part IV (Chapters 1220). Chapter
12 lays the foundation with an introduction to the concepts and ideas behind
statistics, before the coverage begins in earnest in Chapter 13 with sampling
theory, continuing with statistical inference, estimation and hypothesis testing, in Chapters 14 and 15, and regression analysis in Chapter 16. Chapter
17 introduces the important but oft-neglected issue of probability model validation, while Chapter 18 on nonparametric methods extends the ideas of
Chapters 14 and 15 to those cases where the usual probability model assumptions (mostly the normality assumption) are invalid. Chapter 19 presents an
overview treatment of design of experiments. The third and nal set of case
studies is presented in Chapter 20 to illustrate the application of various aspects of statistics to real-life problems.
Part V (Chapters 2123) showcases the application of probability and
statistics with a hand-selected set of special topics: reliability and life testing
in Chapter 21, quality assurance and control in Chapter 22, and multivariate
analysis in Chapter 23. Each has roots in probability and statistics, but all
have evolved into bona de subject matters in their own rights.

Key Features
Before presenting suggestions of how to cover the material for various audiences, I think it is important to point out some of the key features of the
textbook.
1. Approach. This book takes a more fundamental, rst-principles approach to the issue of dealing with random variability and uncertainty in
engineering problems. As a result, for example, the treatment of probability
distributions for random variables (Chapters 810) is based on a derivation of
each model from phenomenological mechanisms, allowing the reader to see the
subterraneous roots from which these probability models sprang. The reader is
then able to see, for instance, how the Poisson model arises either as a limiting
case of the binomial random variable, or from the phenomenon of observing
in nite-sized intervals of time or space, rare events with low probabilities of
occurrence; or how the Gaussian model arises from an accumulation of small
random perturbations.
2. Examples and Case Studies. This fundamental approach note above
is integrated with practical applications in the form of a generous amount
of examples but also with the inclusion of three chapter-length application
case studies, one each for probability, probability distributions, and statistics.
In addition to the usual traditional staples, many of the in-chapter examples
have been drawn from non-traditional applications in molecular biology (e.g.,
DNA replication origin distributions; gene expression data, etc.), from nance
and business, and from population demographics.

vi
3. Computers, Computer Software, On-line resources. As expanded
further in the Appendix, the availability of computers has transformed the
teaching and learning of probability and statistics. Statistical software packages are now so widely available that many of what used to be staples of
traditional probability and statistics textbookstricks for carrying out various computations, approximation techniques, and especially printed statistical
tablesare now essentially obsolete. All the examples in this book were carried out with MINITAB, and I fully expect each student and instructor to have
access to one such statistical package. In this book, therefore, we depart from
tradition and do not include any statistical tables. Instead, we have included
in the Appendix a compilation of useful information about some popular software packages, on-line electronic versions of statistical tables, and a few other
on-line resources such as on-line electronic statistics handbooks, and websites
with data sets.
4. Questions, Exercises, Application Problems, Projects. No one feels
truly condent about a subject matter without having tackled (and solved!)
some problems; and a useful textbook ought to provide a good selection that
oers a broad range of challenges. Here is what is available in this book:
Review Questions: Found at the end of each chapter (with the exception
of the chapters on case studies), these are short, specic questions designed to test the readers basic comprehension. If you can answer all the
review questions at the end of each chapter, you know and understand
the material; if not, revisit the relevant portion and rectify the revealed
deciency.
Exercises: are designed to provide the opportunity to master the mechanics behind a single concept. Some may therefore be purely mechanical in the sense of requiring basic computations; some may require lling in the steps deliberately left as an exercise to the reader; some may
have the avor of an application; but the focus is usually a single aspect
of a topic covered in the text, or a straightforward extension thereof.
Application Problems: are more substantial practical problems whose
solutions usually require integrating various concepts (some obvious,
some not) and deploying the appropriate set of tools. Many of these are
drawn from the literature and involve real applications and actual data
sets. In such cases, the references are provided, and the reader may wish
to consult some of them for additional background and perspective, if
necessary.
Project assignments: allow deeper exploration of a few selected issues
covered in a chapter, mostly as a way of extending the coverage and
also to provide opportunities for creativity. By denition, these involve
a signicant amount of work and also require report-writing. This book
oers a total of nine such projects. They are a good way for students to

vii
learn how to plan, design, and execute projects and to develop writing
and reporting skills. (Each graduate student that has taken the CHEG
604 and CHEG 867 courses at the University of Delaware has had to do
a term project of this type.)
5. Data Sets. All the data sets used in each chapter, whether in the chapter
itself, in an example, or in the exercises or application problems, are made
available on-line and on CD.

Suggested Coverage
Of the three categories mentioned earlier, a methodical coverage of the entire textbook is only possible for Category I, in a two-semester undergraduate
sequence. For this group, the following is one possible approach to dividing
the material up into instruction modules for each semester:
First Semester
Module 1 (Foundations): Chapters 02.
Module 2 (Probability): Chapters 3, 4, 5 and 7.
Module 3 (Probability Models): Chapter 81 (omit detailed derivations
and Section 8.7.2), Chapter 91 (omit detailed derivations), and Chapter
111 (cover Sections 11.4 and 11.5 selectively; omit Section 11.6).
Module 4 (Introduction to Statistics/Sampling): Chapters 12 and 13.
Module 5 (Statistical Inference): Chapter 141 (omit Section 14.6), Chapter 151 (omit Sections 15.8 and 15.9), Chapter 161 (omit Sections 16.4.3,
16.4.4, and 16.5.2), and Chapter 17.
Module 6 (Design of Experiments): Chapter 191 (cover Sections 19.3
19.4 lightly; omit Section 19.10) and Chapter 20.
Second Semester
Module 7 (Probability and Models): Chapters 6 (with ad hoc reference
to Chapters 4 and 5); Chapters 82 and 92 (include details omitted in the
rst semester), Chapter 10.
Module 8 (Statistical Inference): Chapter 142 (Bayesian estimation, Section 14.6), Chapter 152 (Sections 15.8 and 15.9), Chapter 162 (Sections
16.4.3, 16.4.4, and 16.5.2), and Chapter 18.
Module 9 (Applications): Select one of Chapter 21, 22 or 23. (For chemical engineers, and anyone planning to work in the manufacturing industry, I recommend Chapter 22.)
With this as a basic template, other variations can be designed as appropriate.
For example, those who can only aord one semester (Category II) may
adopt the rst semester suggestion above, to which I recommend adding Chapter 22 at the end.

viii
The beginning graduate one-semester course (Cateogory III) may also be
based on the rst semester suggestion above, but with the following additional
recommendations: (i) cover of all the recommended chapters fully; (ii) add
Chapter 23 on multivariate analysis; and (iii) in lieu of a nal examination,
assign at least one, possibly two, of the nine projects.
This will make for a hectic semester, but graduate students should be able
to handle the workload.
A second, perhaps more straightforward, recommendation for a twosemester sequence is to devote the rst semester to Probability (Chapters
011), and the second to Statistics (Chapters 1220) along with one of the
three application chapters.

Acknowledgments
Pulling o a project of this magnitude requires the support and generous
assistance of many colleagues, students, and family. Their genuine words of encouragement and the occasional (innocent and not-so-innocent) inquiry about
the status of the book all contributed to making sure that this potentially
endless project was actually nished. At the risk of leaving someone out, I feel
some deserve particular mention. I begin with, in alphabetical order, Marc
Birtwistle, Ketan Detroja, Claudio Gelmi (Chile), Mary McDonald, Vinay
Prasad (Alberta, Canada), Paul Taylor (AIMS, Muizenberg, South Africa),
and Carissa Young. These are colleagues, former and current students, and
postdocs, who patiently waded through many versions of various chapters,
oered invaluable comments and caught many of the manuscript errors, typographical and otherwise. It is a safe bet that the manuscript still contains
a random number of these errors (few and Poisson distributed, I hope!) but
whatever errors remain are my responsibility. I encourage readers to let me
know of the ones they nd.
I wish to thank my University of Delaware colleagues, Antony Beris and
especially Dion Vlachos, with whom I often shared the responsibility of teaching CHEG 867 to beginning graduate students. Their insight into what the
statistics component of the course should contain was invaluable (as were the
occasional Greek lessons!). Of my other colleagues, I want to thank Dennis
Williams of Basel, for his interest and comments, and then single out former
fellow DuPonters Mike Piovoso, whose ngerprint is recognizable on the
illustrative example of Chapter 23, Ra Sela, now a Six-Sigma Master Black
Belt, Mike Deaton of James Madison University, and Ron Pearson, whose
near-encyclopedic knowledge never ceases to amaze me. Many of the ideas,
problems and approaches evident in this book arose from those discussions
and collaborations from many years ago. Of my other academic colleagues, I
wish to thank Carl Laird of Texas A & M for reading some of the chapters,
Joe Qin of USC for various suggestions, and Jim Rawlings of Wisconsin with
whom I have carried on a long-running discussion about probability and estimation because of his own interests and expertise in this area. David Bacon

ix
and John MacGregor, pioneers in the application of statistics and probability in chemical engineering, deserve my thanks for their early encouragement
about the project and for providing the occasional commentary. I also wish to
take this opportunity to acknowledge the inuence and encouragement of my
chemical engineering mentor, Harmon Ray. I learned more from Harmon than
he probably knew he was teaching me. Much of what is in this book carries
an echo of his voice and reects the Wisconsin tradition.

I must not forget my gracious hosts at the Ecole

Polytechnique Federale
de Lausanne (EPFL), Professor Dominique Bonvin (Merci pour tout, mon
ami) and Professor Vassily Hatzimanikatis (E o o:
Efharisto poli paliole). Without their generous hospitality during the
months from February through July 2009, it is very likely that this project
would have dragged on for far longer. I am also grateful to Michael Amrhein
of the Laboratoire dAutomatique at EPFL, and his graduate student, Paman
Gujral, who both took time to review several chapters and provided additional
useful references for Chapter 23. My thanks go to Allison Shatkin and Marsha
Pronin of CRC Press/Taylor and Francis for their professionalism in guiding
the project through the various phases of the editorial process all the way to
production.
And now to family. Many thanks are due to my sons, Damini and Deji, who
have had cause to use statistics at various stages of their (still on-going) education: each read and commented on a selected set of chapters. My youngest
son, Makinde, still too young to be a proofreader, was nevertheless solicitous
of my progress, especially towards the end. More importantly, however, just
by showing up when he did, and how, he conrmed to me without meaning
to, that he is a natural-born Bayesian. Finally, the debt of thanks I owe to my
wife, Anna, is dicult to express in a few words of prose. She proofread many
of the chapter exercises and problems with an incredible eye, and a sensitive
ear for the language. But more than that, she knows well what it means to be
a book widow; without her forbearance, encouragement, and patience, this
project would never have been completed.
Babatunde A. Ogunnaike
Newark, Delaware
Lausanne, Switzerland
April 2009

List of Figures

1.1
1.2
1.3
1.4
1.5
1.6

Histogram for YA data . . . . . . . . . . . . . . . . . . . . . . .

Histogram for YB data . . . . . . . . . . . . . . . . . . . . . . .
Histogram of inclusions data . . . . . . . . . . . . . . . . . . . .
Histogram for YA data with superimposed theoretical distribution
Histogram for YB data with superimposed theoretical distribution
Theoretical probability distribution function for a Poisson random
variable with parameter = 1.02. Compare with the inclusions data
histogram in Fig 1.3 . . . . . . . . . . . . . . . . . . . . . . . .

19
20
22
24
24

Schematic diagram of a plug ow reactor (PFR). . . . . . . .

Schematic diagram of a continuous stirred tank reactor
(CSTR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Instantaneous residence time distribution function for the
CSTR: (with = 5). . . . . . . . . . . . . . . . . . . . . . . .

3.1
3.2
3.3
3.4
3.5

Venn Diagram for Example 3.7 . . . . . . . . . . . . . .

Venn diagram of students in a thermodynamics class . . .
The role of conditioning Set B in conditional probability
Representing set A as a union of 2 disjoint sets . . . . . .
Partitioned sets for generalizing total probability result . .

66
72
73
74
75

4.1

The original sample space, , and the corresponding space V induced

by the random variable X . . . . . . . . . . . . . . . . . . . . .
Probability distribution function, f (x), and cumulative distribution
function, F (x), for 3-coin toss experiment of Example 4.1 . . . . .
Distribution of a negatively skewed random variable . . . . . . .
Distribution of a positively skewed random variable . . . . . . . .
Distributions with reference kurtosis (solid line) and mild kurtosis
(dashed line) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Distributions with reference kurtosis (solid line) and high kurtosis
(dashed line) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The pdf of a continuous random variable X with a mode at x = 1
The cdf of a continuous random variable X showing the lower and
upper quartiles and the median . . . . . . . . . . . . . . . . . .

2.1
2.2
2.3

4.2
4.3
4.4
4.5
4.6
4.7
4.8

.
.
.
.
.

91
97
110
110
111
112
117
118
xi

xii
5.1
5.2
5.3
5.4
6.1
6.2
9.1
9.2

9.3
9.4

9.5
9.6

9.7
9.8

9.9

9.10

9.11

9.12
9.13
9.14

Graph of the joint pdf for the 2-dimensional random

Example 5.5 . . . . . . . . . . . . . . . . . . . . . .
Positively correlated variables: = 0.923 . . . . . . .
Negatively correlated variables: = 0.689 . . . . . .
Essentially uncorrelated variables: = 0.085 . . . . . .

variable of

.
.
.
.

149
159
159
160

Region of interest, VY , for computing the cdf of the random variable

Y dened as a sum of 2 independent random variables X1 and X2
Schematic diagram of the tennis ball launcher of Problem 6.11 . .

178
193

Exponential pdfs for various values of parameter . . . . . . . .

Gamma pdfs for various values of parameter and : Note how with
increasing values of the shape becomes less skewed, and how the
breadth of the distribution increases with increasing values of .
Gamma distribution t to data on inter-origin distances in the budding yeast S. cerevisiae genome . . . . . . . . . . . . . . . . . .
Weibull pdfs for various values of parameter and : Note how with
increasing values of the shape becomes less skewed, and how the
breadth of the distribution increases with increasing values of .
The Herschel-Maxwell 2-dimensional plane . . . . . . . . . . . .
Gaussian pdfs for various values of parameter and : Note the
symmetric shapes, how the center of the distribution is determined
by , and how the shape becomes broader with increasing values of
Symmetric tail area probabilities for the standard normal random
variable with z = 1.96 and FZ (1.96) = 0.025 = 1 FZ (1.96) . .
Lognormal pdfs for scale parameter = 0 and various values of
the shape parameter . Note how the shape changes, becoming less
skewed as becomes smaller. . . . . . . . . . . . . . . . . . . .
Lognormal pdfs for shape parameter = 1 and various values of the
scale parameter . Note how the shape remains unchanged while the
entire distribution is scaled appropriately depending on the value of
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Particle size distribution for the granulation process product: a lognormal distribution with = 6.8, = 0.5. The shaded area corresponds to product meeting quality specications, 350 < X < 1650
microns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Unimodal Beta pdfs when > 1; > 1: Note the symmetric shape
when = , and the skewness determined by the value of relative
to . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
U-Shaped Beta pdfs when < 1; < 1 . . . . . . . . . . . . . .
Other shapes of the Beta pdfs: It is J-shaped when (1)( 1) < 0
and a straight line when = 2; = 1 . . . . . . . . . . . . . . .
Theoretical distribution for characterizing fractional microarray intensities for the example gene: The shaded area corresponds to the
probability that the gene in question is upregulated. . . . . . . .

262

267
270

274
286

289
291

295

298

304
304
305

307

xiii
9.15 Two uniform distributions over dierent ranges (0,1) and (2,10).
Since the total area under the pdf must be 1, the narrower pdf is
proportionately longer than the wider one. . . . . . . . . . . . .
Two F distribution plots for dierent values for 1 , the rst degree
of freedom, but the same value for 2 . Note how the mode shifts to
the right as 1 increases . . . . . . . . . . . . . . . . . . . . . .
Three tdistribution plots for degrees of freedom values =
5, 10, 100. Note the symmetrical shape and the heavier tail for
smaller values of . . . . . . . . . . . . . . . . . . . . . . . . . .
A comparison of the tdistribution with = 5 with the standard
normal N (0, 1) distribution. Note the similarity as well as the tdistributions comparatively heavier tail. . . . . . . . . . . . . .
A comparison of the tdistribution with = 50 with the standard
normal N (0, 1) distribution. The two distributions are practically
indistinguishable. . . . . . . . . . . . . . . . . . . . . . . . . . .
A comparison of the standard Cauchy distributions with the standard normal N (0, 1) distribution. Note the general similarities as
well as the Cauchy distributions substantially heavier tail. . . . .
Common probability distributions and connections among them .

315
319

10.1 The entropy function of a Bernoulli random variable . . . . . . .

340

Elsner data versus binomial model prediction . . . . . . . . . . .

Elsner data (Younger set) versus binomial model prediction . .
Elsner data (Older set) versus binomial model prediction . . . .
Elsner data (Younger set) versus stratied binomial model prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.5 Elsner data (Older set) versus stratied binomial model prediction
11.6 Complete Elsner data versus stratied binomial model prediction .
11.7 Optimum number of embryos as a function of p . . . . . . . . . .
11.8 Surface plot of the probability of a singleton as a function of p and
the number of embryos transferred, n . . . . . . . . . . . . . . .
11.9 The (maximized) probability of a singleton as a function of p when
the optimum integer number of embryos are transferred . . . . . .
11.10Surface plot of the probability of no live birth as a function of p and
the number of embryos transferred, n . . . . . . . . . . . . . . .
11.11Surface plot of the probability of multiple births as a function of p
and the number of embryos transferred, n . . . . . . . . . . . . .
11.12IVF treatment outcome probabilities for good prognosis patients
with p = 0.5, as a function of n, the number of embryos transferred
11.13IVF treatment outcome probabilities for medium prognosis patients with p = 0.3, as a function of n, the number of embryos
transferred . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.14IVF treatment outcome probabilities for poor prognosis patients
with p = 0.18, as a function of n, the number of embryos transferred

379
381
382

9.16

9.17

9.18

9.19

9.20

9.21

11.1
11.2
11.3
11.4

309

311

312

313

383
383
384
386
388
388
389
389
391

392
393

xiv
11.15Relative sensitivity of the binomial model derived n to errors in
estimates of p as a function of p . . . . . . . . . . . . . . . . . .
12.1 Relating the tools of Probability, Statistics and Design of Experiments to the concepts of Population and Sample . . . . . . . . .
12.2 Bar chart of welding injuries from Table 12.1 . . . . . . . . . . .
12.3 Bar chart of welding injuries arranged in decreasing order of number
of injuries . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.4 Pareto chart of welding injuries . . . . . . . . . . . . . . . . . .
12.5 Pie chart of welding injuries . . . . . . . . . . . . . . . . . . . .
12.6 Bar Chart of frozen ready meals sold in France in 2002 . . . . . .
12.7 Pie Chart of frozen ready meals sold in France in 2002 . . . . . .
12.8 Histogram for YA data of Chapter 1 . . . . . . . . . . . . . . . .
12.9 Frequency Polygon of YA data of Chapter 1 . . . . . . . . . . . .
12.10Frequency Polygon of YB data of Chapter 1 . . . . . . . . . . . .
12.11Boxplot of the chemical process yield data YA , YB of Chapter 1 .
12.12Boxplot of random N(0,1) data: original set, and with added outlier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.13Box plot of raisins dispensed by ve dierent machines . . . . . .
12.14Scatter plot of cranial circumference versus nger length: The plot
shows no real relationship between these variables . . . . . . . . .
12.15Scatter plot of city gas mileage versus highway gas mileage for various two-seater automobiles: The plot shows a strong positive linear
relationship, but no causality is implied. . . . . . . . . . . . . . .
12.16Scatter plot of highway gas mileage versus engine capacity for various two-seater automobiles: The plot shows a negative linear relationship. Note the two unusually high mileage values associated
with engine capacities 7.0 and 8.4 liters identied as belonging to
the Chevrolet Corvette and the Dodge Viper, respectively. . . . .
12.17Scatter plot of highway gas mileage versus number of cylinders for
various two-seater automobiles: The plot shows a negative linear
relationship. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.18Scatter plot of US population every ten years since the 1790 census versus census year: The plot shows a strong non-linear trend,
with very little scatter, indicative of the systematic, approximately
exponential growth . . . . . . . . . . . . . . . . . . . . . . . . .
12.19Scatter plot of Y1 and X1 from Anscombe data set 1. . . . . . . .
12.20Scatter plot of Y2 and X2 from Anscombe data set 2. . . . . . . .
12.21Scatter plot of Y3 and X3 from Anscombe data set 3. . . . . . . .
12.22Scatter plot of Y4 and X4 from Anscombe data set 4. . . . . . . .

396

415
420
420
421
422
423
424
425
427
428
429
430
431
432

433

434

435
444
445
445
446

13.1 Sampling distribution for mean lifetime of DLP lamps in Example

< 5200) = P (0.66 < Z < 1.34)
13.3 used to compute P (5100 < X

469

13.2 Sampling distribution for average lifetime of DLP lamps in Example

< 5000) = P (Z < 2.67) . . . . . . .
13.3 used to compute P (X

470

xv
13.3 Sampling distribution of the mean diameter of ball bearings in Ex 10| 0.14) = P (|T | 0.62) .
ample 13.4 used to compute P (|X

473

13.4 Sampling distribution for the variance of ball bearing diameters in

Example 13.5 used to compute P (S 1.01) = P (C 23.93) . . .

475

13.5 Sampling distribution for the two variances of ball bearing diameters
in Example 13.6 used to compute P (F 1.41) + P (F 0.709) . .

476

14.1 Sampling distribution for the two estimators U1 and U2 : U1 is the

more ecient estimator because of its smaller variance . . . . . .

491

14.2 Two-sided tail area probabilities of /2 for the standard normal

sampling distribution . . . . . . . . . . . . . . . . . . . . . . .

504

14.3 Two-sided tail area probabilities of /2 = 0.025 for a Chi-squared

distribution with 9 degrees of freedom . . . . . . . . . . . . . . .

511

14.4 Sampling distribution with two-sided tail area probabilities of 0.025

for X/,
based on a sample of size n = 10 from an exponential
population . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

516

14.5 Sampling distribution with two-sided tail area probabilities of 0.025

for X/,
based on a larger sample of size n = 100 from an exponential population . . . . . . . . . . . . . . . . . . . . . . . . . . .

517

15.1 A distribution for the null hypothesis, H0 , in terms of the test statistic, QT , where the shaded rejection region, QT > q, indicates a signicance level, . . . . . . . . . . . . . . . . . . . . . . . . . .

557

15.2 Overlapping distributions for the null hypothesis, H0 (with mean

0 ), and alternative hypothesis, Ha (with mean a ), showing Type
I and Type II error risks , , along with qC the boundary of the
critical region of the test statistic, QT . . . . . . . . . . . . . . .

559

15.3 The standard normal variate z = z with tail area probability .

The shaded portion is the rejection region for a lower-tailed test,
Ha : < 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

564

15.4 The standard normal variate z = z with tail area probability .

The shaded portion is the rejection region for an upper-tailed test,
Ha : > 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

565

15.5 Symmetric standard normal variates z = z/2 and z = z/2 with

identical tail area probabilities /2. The shaded portions show the
rejection regions for a two-sided test, Ha : = 0 . . . . . . . . .

565

15.6 Box plot for Method A scores including the null hypothesis mean,
H0 : = 75, shown along with the sample average, x
, and the
95% condence interval based on the t-distribution with 9 degrees
of freedom. Note how the upper bound of the 95% condence interval
lies to the left of, and does not touch, the postulated H0 value . .

574

xvi
15.7 Box plot for Method B scores including the null hypothesis mean,
H0 , = 75, shown along with the sample average, x
, and the 95%
condence interval based on the t-distribution with 9 degrees of freedom. Note how the the 95% condence interval includes the postulated H0 value . . . . . . . . . . . . . . . . . . . . . . . . . . .

574

15.8 Box plot of dierences between the Before and After weights,
including a 95% condence interval for the mean dierence, and the
hypothesized H0 point, 0 = 0 . . . . . . . . . . . . . . . . . . .

15.9 Box plot of the Before and After weights including individual
data means. Notice the wide range of each data set . . . . . . . .
15.10A plot of the Before and After weights for each patient. Note
how one data sequence is almost perfectly correlated with the other;
in addition note the relatively large variability intrinsic in each data
set compared to the dierence between each point . . . . . . . . .

588
590

590

15.11Null and alternative hypotheses distributions for upper-tailed test

based on n = 25 observations, with population standard deviation
= 4, where the true alternative mean, a , exceeds the hypothesized

one by = 2.0. The gure shows a z-shift of ( n)/ = 2.5;

and with reference to H0 , the critical value z0.05 = 1.65. The area
under the H0 curve to the right of the point z = 1.65 is = 0.05,
the signicance level; the area under the dashed Ha curve to the left
of the point z = 1.65 is . . . . . . . . . . . . . . . . . . . . .

15.12 and power values for hypothesis test of Fig 15.11 with Ha
N (2.5, 1). Top:; Bottom: Power = (1 ) . . . . . . . . . . . .
15.13Rejection regions for one-sided tests of a single variance of a normal
population, at a signicance level of = 0.05, based on n = 10
samples. The distribution is 2 (9); Top: for Ha : 2 > 02 , indicating
rejection of H0 if c2 > 2 (9) = 16.9; Bottom: for Ha : 2 < 02 ,
indicating rejection of H0 if c2 < 21 (9) = 3.33 . . . . . . . . .

592
594

602

15.14Rejection regions for the two-sided tests concerning the variance of

2
the process A yield data H0 : A
= 1.52 , based on n = 50 samples,
at a signicance level of = 0.05. The distribution is 2 (49), with
the rejection region shaded; because the test statistic, c2 = 44.63,
falls outside of the rejection region, we do not reject H0 . . . . . .

604

15.15Rejection regions for the two-sided tests of the equality of the vari2
2
= B
,
ances of the process A and process B yield data, i.e., H0 : A
at a signicance level of = 0.05, based on n = 50 samples each.
The distribution is F (49, 49), with the rejection region shaded; since
the test statistic, f = 0.27, falls within the rejection region to the
left, we reject H0 in favor of Ha . . . . . . . . . . . . . . . . . . .

606

16.1 Boiling point of hydrocarbons in Table 16.1 as a function of the

number of carbon atoms in the compound . . . . . . . . . . . . .
16.2 The true regression line and the zero mean random error i . . . .

649
654

xvii
16.3 The Gaussian assumption regarding variability around the true regression line giving rise to N (0, 2 ): The 6 points represent the
data at x1 , x2 , . . . , x6 ; the solid straight line is the true regression
line which passes through the mean of the sequence of the indicated
Gaussian distributions . . . . . . . . . . . . . . . . . . . . . . .

655

16.4 The tted straight line to the Density versus Ethanol Weight % data:
The additional terms included in the graph, S, R-Sq and R-Sq(adj)
are discussed later . . . . . . . . . . . . . . . . . . . . . . . . .

659

16.5 The tted regression line to the Density versus Ethanol Weight %
data (solid line) along with the 95% condence interval (dashed line).
The condence interval is narrowest at x = x
and widens for values
further away from x
. . . . . . . . . . . . . . . . . . . . . . . . .

664

16.6 The tted straight line to the Cranial circumference versus Finger
length data. Note how the data points are widely scattered around
the tted regression line. (The additional terms included in the
graph, S, R-Sq and R-Sq(adj) are discussed later) . . . . . . . . .

667

16.7 The tted straight line to the Highway MPG versus Engine Capacity
data of Table 12.5 (leaving out the two inconsistent data points)
along with the 95% condence interval (long dashed line) and the
95% prediction interval (short dashed line). (Again, the additional
terms included in the graph, S, R-Sq and R-Sq(adj) are discussed
later). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

670

16.8 Modeling the temperature dependence of thermal conductivity: Top:

Fitted straight line to the Thermal conductivity (k) versus Temperature (T C) data in Table 16.6; Bottom: standardized residuals versus
tted value, yi . . . . . . . . . . . . . . . . . . . . . . . . . . . .

681

16.9 Modeling the dependence of the boiling points (BP) of hydrocarbon

compounds in Table 16.1 on the number of carbon atoms in the
compound: Top: Fitted straight line of BP versus n, the number of
carbon atoms; Bottom: standardized residuals versus tted value yi .
Notice the distinctive quadratic structure left over in the residuals
exposing the linear models over-estimation at the extremes and the
under-estimation in the middle. . . . . . . . . . . . . . . . . . .

16.10Catalytic process yield data of Table 16.7 . . . . . . . . . . . . .

16.11Catalytic process yield data of Table 16.1. Top: Fitted plane of Yield
as a function of Temperature and Pressure; Bottom: standardized
residuals versus tted value yi . Nothing appears unusual about these
residuals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

683
692

695

16.12Modeling the dependence of the boiling points (BP) of hydrocarbon

compounds in Table 16.1 on the number of carbon atoms in the
compound: Top: Fitted quadratic curve of BP versus n, the number
of carbon atoms; Bottom: standardized residuals versus tted value
yi . Despite the good t, the visible systematic structure still left
over in the residuals suggests adding one more term to the model.

703

xviii
16.13Modeling the dependence of the boiling points (BP) of hydrocarbon
compounds in Table 16.1 on the number of carbon atoms in the compound: Top: Fitted cubic curve of BP versus n, the number of carbon
atoms; Bottom: standardized residuals versus tted value yi . There
appears to be little or no systematic structure left in the residuals,
suggesting that the cubic model provides an adequate description of
the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

705

16.14Gram polynomials evaluated at 5 discrete points k = 1, 2, 3, 4, 5; p0

is the constant; p1 , the straight line; p2 , the quadratic and p3 , the
cubic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

707

17.1 Probability plots for safety data postulated to be exponentially distributed, each showing (a) rank ordered data; (b) theoretical tted
cumulative probability distribution line along with associated 95%
condence intervals; (c) a list of summary statistics, including the
p-value associated with a formal goodness-of-t test. The indication
from the p-values is that there is no evidence to reject H0 ; therefore
the model appears to be adequate . . . . . . . . . . . . . . . . .

738

17.2 Probability plot for safety data S2 wrongly postulated to be normally

distributed. The departure from the linear t does not appear too
severe, but the low/borderline p-value (0.045) objectively compels
us to reject H0 at the 0.05 signicance level and conclude that the
Gaussian model is inadequate for this data. . . . . . . . . . . . .

739

17.3 Probability plots for yield data sets YA and YB postulated to be

normally distributed. The 95% condence intervals around the tted
line, along with the indicated p-values, strongly suggest that the
distributional assumptions appear to be valid. . . . . . . . . . . .

740

17.4 Normal probability plot for the residuals of the regression analysis
of the dependence of thermal conductivity, k, on Temperature in
Example 16.5. The postulated model, a two-parameter regression
model with Gaussian distributed zero mean errors, appears valid. .

741

17.5 Chi-Squared test results for inclusions data and a postulated Poisson
model. Top panel: Bar chart of Expected and Observed frequencies, which shows how well the model prediction matches observed
data; Bottom Panel: Bar chart of contributions to the Chi-squared
statistic, showing that the group of 3 or more inclusions is responsible for the largest model-observation discrepancy, by a wide margin. 744

18.1 Histograms of interspike intervals data with Gamma model t for

the pyramidal tract cell of a monkey. Top panel: when awake (PTW); Bottom Panel: when asleep (PT-S). Note the similarities in the
estimated values for the shape parameterfor both sets of data,
and the dierence between the estimates for , the scale parameters. 774

xix
18.2 Probability plot of interspike intervals data with postulated Gamma
model and Anderson-Darling test for the pyramidal tract cell of a
monkey. Top panel: when awake (PT-W); Bottom panel: when asleep
(PT-S). The p-values for the A-D tests indicate no evidence to reject
the null hypothesis . . . . . . . . . . . . . . . . . . . . . . . . .

19.1 Graphic illustration of the orthogonal vector decomposition of Eq

(19.11) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.2 Boxplot of raisins data showing what the ANOVA analysis has conrmed that there is a signicant dierence in how the machines
dispense raisins. . . . . . . . . . . . . . . . . . . . . . . . . . .

776

800

802

19.3 Normal probability plots of the residuals from the one-way classication ANOVA model in Example 19.1. Top panel: Plot obtained
directly from the ANOVA analysis which does not provide any test
statistic or signicance level; Bottom panel: Subsequent goodnessof-t test carried out on saved residuals; note the high p-value associated with the A-D test. . . . . . . . . . . . . . . . . . . . . .

19.4 Graphic illustration of the orthogonal error decomposition of Eq

(19.21) with the additional block component, EB . . . . . . . . . .
19.5 Normal probability plots of the residuals from the two-way classication ANOVA model for investigating tire wear, obtained directly
from the ANOVA analysis. . . . . . . . . . . . . . . . . . . . .

804
807

810

19.6 2 factorial design for factors A and B showing the four experimental
points; represents low values, + represents high values for each
factor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

815

19.7 Graphic illustration of folding where two half-fractions of a 23

factorial design are combined to recover the full factorial design;
each fold costs an additional degree of freedom for analysis. . . . .

19.8 Normal probability plot for the eects, using Lenths method to
identify A, D and AD as signicant. . . . . . . . . . . . . . . . .
19.9 Normal probability plot for the residuals of the Etch rate model in
Eq (19.46) obtained upon projection of the experimental data to
retain only the signicant terms A, Gap (x1 ), D, Power (x2 ), and
the interaction AD, Gap*Power (x1 x2 ). . . . . . . . . . . . . . .

826
830

832

19.10The 3-factor face-centered cube (FCC) response surface design and

its constituent parts: 23 factorial base, Open circles; face center
points, lighter shaded circles; center point, darker solid circle. . . .

835

19.11The 3-factor Box-Behnken response surface design and its constituent parts: X1 , X2 : 22 factorial points moved to the center of
X3 to give the darker shaded circles at the edge-centers of the X3
axes; X2 , X3 : 22 factorial points moved to the center of X1 to give
the lighter shaded circles at the edge-centers of the X1 axes; X1 , X3 :
22 factorial points moved to the center of X2 to give the solid circles
at the edge-centers of the X2 axes; the center point, open circle. .

836

xx
20.1 Chi-Squared test results for Prussian army death by horse kicks data
and a postulated Poisson model. Top panel: Bar chart of Expected
and Observed frequencies; Bottom Panel: Bar chart of contributions to the Chi-squared statistic. . . . . . . . . . . . . . . . . .
20.2 Initial prior distribution, a Gamma (2,0.5), used to obtain a Bayesian
estimate for the Poisson mean number of deaths per unit-year parameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.3 Recursive Bayesian estimates using yearly data sequentially, compared with the standard maximum likelihood estimate, 0.61,
(dashed-line). . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.4 Final posterior distribution (dashed line) along with initial prior
distribution (solid line). . . . . . . . . . . . . . . . . . . . . . .
20.5 Quadratic regression model t to US Population data along with
both the 95% condence interval and the 95% prediction interval.
20.6 Standardized residuals from the regression model t to US Population data. Top panel: Residuals versus observation order; Bottom
panel: Normal probability plot. Note the left-over pattern indicative
of serial correlation, and the unusual observations identied for the
1940 and 1950 census years in the top panel; note also the general
deviation of the residuals from the theoretical normal probability
distribution line in the bottom panel. . . . . . . . . . . . . . . .
20.7 Percent average relative population growth rate in the US for each
census year from 1800-2000 divided into three equal 70-year periods.
Period 1: 1800-1860; Period 2: 1870-1930; Period 3: 1940-2000. . .
20.8 Normal probability plot for the residuals from the ANOVA model
for Percent average relative population growth rate versus Period
with Period 1: 1800-1860; Period 2: 1870-1930; Period 3: 1940-2000.
20.9 Standardized residual plots for Yield response surface model: versus tted value, and normal probability plot. . . . . . . . . . . .
20.10Standardized residual plots for Adhesion response surface model:
versus tted value, and normal probability plot. . . . . . . . . . .
20.11Response surface and contour plots for Yield as a function of Additive and Temperature (with Time held at 60.00). . . . . . . . .
20.12Response surface and contour plots for Adhesion as a function of
Additive and Temperature (with Time held at 60.00). . . . . . . .
20.13Overlaid contours for Yield and Adhesion showing feasible region for desired optimum. The planted ag indicates the optimum
values of the responses along with the corresponding setting of the
factors Additive and Temperature (with Time held at 60.00) that
achieve this optimum. . . . . . . . . . . . . . . . . . . . . . . .
20.14Schematic diagram of folded helicopter prototype . . . . . . . . .
20.15Paper helicopter prototype . . . . . . . . . . . . . . . . . . . . .

21.1 Simple Systems: Series and parallel conguration . . . . . . . . .

21.2 A series-parallel arrangement of a 6-component system . . . . . .

861

864

867
868
874

875

877

878
884
885
886
887

888
891
893
902
902

xxi
21.3 Sampling-analyzer system: basic conguration . . . . . . . . . . .
21.4 Sampling-analyzer system: conguration with redundant solenoid
valve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.5 Fluid ow system with a cross link . . . . . . . . . . . . . . . .
21.6 Typical failure rate (hazard function) curve showing the classic three
distinct characteristic periods in the lifetime distributions of a population of items . . . . . . . . . . . . . . . . . . . . . . . . . .

21.7 Blood storage system . . . . . . . . . . . . . . . . . . . . . . .

21.8 Nuclear power plant heat exchanger system . . . . . . . . . . . .
21.9 Fluid ow system with a cross link (from Fig 21.5) . . . . . . . .
21.10Fire alarm system with back up . . . . . . . . . . . . . . . . . .
21.11Condenser system for VOCs . . . . . . . . . . . . . . . . . . . .
21.12Simplied representation of the control structure in the baroreceptor
reex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

907
907
909

913
926
927
927
928
928
929

22.1 OC Curve for a lot size of 1000, sample size of 32 and acceptance
number of 3: AQL is the acceptance quality level; RQL is the rejection quality level. . . . . . . . . . . . . . . . . . . . . . . . . . .
22.2 OC Curve for a lot size of 1000, generated for a sampling plan for an
AQL= 0.004 and an RQL = 0.02, leading to a required sample size
of 333 and acceptance number of 3. Compare with the OC curve in
Fig 22.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22.3 A generic SPC chart for the generic process variable Y indicating a
sixth data point that is out of limits. . . . . . . . . . . . . . . .
22.4 The X-bar chart for the average length measurements for 6-inch nails
determined from samples of three measurements obtained every 5
mins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22.5 The S-chart for the 6-inch nails process data of Example 22.2. . .
22.6 The combination Xbar-R chart for the 6-inch nails process data of
Example 22.2. . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.7 The combination I-MR chart for the Mooney viscosity data. . . .
22.8 P-chart for the data on defective mechanical pencils: note the 9th
observation that is outside the UCL. . . . . . . . . . . . . . . . .
22.9 C-chart for the inclusions data presented in Chapter 1, Table 1.2,
and discussed in subsequent chapters: note the 33rd observation that
is outside the UCL, otherwise, the process appears to be operating
in statistical control . . . . . . . . . . . . . . . . . . . . . . . .
22.10Time series plot of the original Mooney viscosity data of Fig 22.7
and Table 22.2, and of the shifted version showing a step increase of
0.7 after sample 15. . . . . . . . . . . . . . . . . . . . . . . . .

939

943
946

948
951
952
954
956

958

959

22.11I-chart for the shifted Mooney viscosity data. Even with = 0.5, it
is not sensitive enough to detect the step change of 0.7 introduced
after sample 15. . . . . . . . . . . . . . . . . . . . . . . . . . .

960

xxii
22.12Two one-sided CUSUM charts for the shifted Mooney viscosity data.
The upper chart uses dots; the lower chart uses diamonds; the nonconforming points are represented with the squares. With the same
= 0.5, the step change of 0.7 introduced after sample 15 is identied after sample 18. Compare with the I-Chart in Fig 22.11. . . .

962

22.13Two one-sided CUSUM charts for the original Mooney viscosity data
using the same characteristics as those in Fig 22.12. The upper
chart uses dots; the lower chart uses diamonds; there are no nonconforming points. . . . . . . . . . . . . . . . . . . . . . . . . .

962

22.14EWMA chart for the shifted Mooney viscosity data, with w = 0.2.
Note the staircase shape of the control limits for the earlier data
points. With the same = 0.5, the step change of 0.7 introduced
after sample 15 is detected after sample 18. The non-conforming
points are represented with the squares. Compare with the I-Chart
in Fig 22.11 and the CUSUM charts in Fig 22.12. . . . . . . . . .

964

22.15The EWMA chart for the original Mooney viscosity data using the
same characteristics as in Fig 22.14. There are no non-conforming
points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

965

23.1 Examples of the bivariate Gaussian distribution where the two random variables are uncorrelated ( = 0) and strongly positively correlated ( = 0.9). . . . . . . . . . . . . . . . . . . . . . . . . . .

23.2 Plot of the 16 variables in the illustrative example data set. . . . .

23.3 Scree plot showing that the rst two components are the most important. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.4 Plot of the scores and loading for the rst principal component. The
distinct trend indicated in the scores should be interpreted along
with the loadings by comparison to the full original data set in Fig
23.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

981
992
994

995

23.5 Plot of the scores and loading for the second principal component.
The distinct trend indicated in the scores should be interpreted along
with the loadings by comparison to the full original data set in Fig
23.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

996

23.6 Scores and loading plots for the rst two components. Top panel:
Scores plot indicates a quadratic relationship between the two scores
t1 and t2 ; Bottom panel: Loading vector plot indicates that in the
new set of coordinates, the original variables contain mostly pure
components PC1 and PC2 indicated by a distinctive North/South
and West/East alignment of the data vectors, with like variables
clustered together according to the nature of the component contributions. Compare to the full original data set in Fig 23.2. . . . . .

998

23.7 Principal component model for a 3-dimensional data set described

by two principal components on a plane, showing a point with a
large Q and another with a large T 2 value. . . . . . . . . . . . . 1001

xxiii
23.8 Control limits for Q and T 2 for process data represented with two
principal components. . . . . . . . . . . . . . . . . . . . . . . . 1001

xxiv

List of Tables

1.1
1.2
1.3
1.4
1.5

Yield Data for Process A versus Process B . . . . . . . . . .

Number of inclusions on sixty 1-sq meter glass sheets . .
Group classication and frequencies for YA data . . . . . . .
Group classication and frequencies for YB data . . . . . .
Group classication and frequencies for the inclusions data

.
.
.
.
.

13
16
18
19
21

2.1

Computed probabilities of occurrence of various number of inclusions for = 2 . . . . . . . . . . . . . . . . . . . . . . . . .

3.1
3.2
3.3

Subsets and Events . . . . . . . . . . . . . . . . . . . . . . . .

Class list and attributes . . . . . . . . . . . . . . . . . . . . .
Lithium toxicity study results . . . . . . . . . . . . . . . . . .

63
65
85

4.1

f (x) and F (x) for the three coin-toss experiments of Example

4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The pdf f (x) for the ball-drawing game . . . . . . . . . . . .
Summary analysis for the ball-drawing game . . . . . . . . .

96
103
104

4.2
4.3
5.1
5.2
5.3
5.4
5.5

. . . . .
. . . . .
. . . . .
. . . . .
Example
. . . . .

151
152
152
153

202

7.4

Summary of Mendels single trait experiment results . . . . .

Theoretical distribution of shape-color traits in second generation hybrids under the independence assumption . . . . . . .
Theoretical versus experimental results for second generation
hybrid plants . . . . . . . . . . . . . . . . . . . . . . . . . . .
Attacks and hits on US Naval Warships in 1943 . . . . . . . .

8.1
8.2

Theoretical versus empirical frequencies for inclusions data .

Summary of probability models for discrete random variables

241
245

9.1

Summary of probability models for continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

318

7.1
7.2
7.3

Joint pdf for computer store sales . . . . . . . . . . .

Joint and marginal pdfs for computer store sales . .
Conditional pdf f (x1 |x2 ) for computer store sales . .
Conditional pdf f (x2 |x1 ) for computer store sales . .
Joint and marginal pdfs for two-coin toss problem of
5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

162

207
208
210

xxv

xxvi
10.1 Summary of maximum entropy probability models . . . . . .
11.1 Theoretical distribution of probabilities of possible outcomes of
an IVF treatment . . . . . . . . . . . . . . . . . . . . . . . . .
11.2 Elsner, et al., data of outcomes of a 42-month IVF treatment
study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3 Binomial model prediction of Elsner, et al. data in Table 11.2
11.4 Elsner data stratied by age indicating variability in the probability of success estimates . . . . . . . . . . . . . . . . . . .
11.5 Stratied binomial model prediction of Elsner, et al. data. . .
12.1 Number and Type of injuries incurred by welders in the USA
from 1980-1989 . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2 Frozen Ready meals in France, in 2002 . . . . . . . . . . . . .
12.3 Group classication and frequencies for YA data . . . . . . . .
12.4 Number of raisins dispensed into trial-sized Raising Bran cereal boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.5 Gasoline mileage ratings for a collection of two-seater automobiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.6 Descriptive statistics for yield data sets YA and YB . . . . . .
12.7 The Anscombe data set 1 . . . . . . . . . . . . . . . . . . . .
12.8 The Anscombe data sets 2, 3, and 4 . . . . . . . . . . . . . .

356

373
376
378
379
382

419
422
425
430
433
441
443
443

14.1 Summary of estimation results . . . . . . . . . . . . . . . . .

14.2 Some population parameters and conjugate prior distributions
appropriate for their Bayesian estimation . . . . . . . . . . .

549

15.1
15.2
15.3
15.4
15.5
15.6

558
566
571
577
579

Hypothesis test decisions and risks . . . . . . . . . . . . . . .

Summary of H0 rejection conditions for the one-sample z-test
Summary of H0 rejection conditions for the one-sample t-test
Summary of H0 rejection conditions for the two-sample z-test
Summary of H0 rejection conditions for the two-sample t-test
Before and After weights for patients on a supervised
weight-loss program . . . . . . . . . . . . . . . . . . . . . . .
15.7 Summary of H0 rejection conditions for the paired t-test . . .
15.8 Sample size n required to achieve a power of 0.9 . . . . . . .
15.9 Summary of H0 rejection conditions for the 2 -test . . . . . .
15.10Summary of H0 rejection conditions for the F -test . . . . . .
15.11Summary of H0 rejection conditions for the single-proportion
z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.12Summary of Selected Hypothesis Tests and their Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16.1 Boiling points of a series of hydrocarbons . . . . . . . . . . .

16.2 Density (in gm/cc) and weight percent of ethanol in ethanolwater mixture . . . . . . . . . . . . . . . . . . . . . . . . . . .

550

586
587
598
601
604
608
645
649
658

xxvii
16.3 Density and weight percent of ethanol in ethanol-water mixture: model t and residual errors . . . . . . . . . . . . . . . .
16.4 Cranial circumference and nger lengths . . . . . . . . . . . .
16.5 ANOVA Table for Testing Signicance of Regression . . . . .
16.6 Thermal conductivity measurements at various temperatures
for a metal . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.7 Laboratory experimental data on Yield . . . . . . . . . . . .
17.1 Table of values for safety data probability plot . . . . . . . .
18.1 A professors teaching evaluation scores organized by student
type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.2 Interspike intervals data . . . . . . . . . . . . . . . . . . . . .
18.3 Summary of Selected Nonparametric Tests and their Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.1 Data table for typical single-factor experiment . . . . . . .
19.2 One-Way Classication ANOVA Table . . . . . . . . . . .
19.3 Data table for typical single-factor, two-way classication,
periment . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.4 Two-Way Classication ANOVA Table . . . . . . . . . . .
19.5 Data table for typical two-factor experiment . . . . . . . .
19.6 Two-factor ANOVA Table . . . . . . . . . . . . . . . . . .

. .
. .
ex. .
. .
. .
. .

20.1 Frequency distribution of Prussian army deaths by horse kicks

20.2 Actual vs Predicted Frequency distribution of Prussian army
deaths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.3 Year-by-Year, Unit-by-Unit breakdown of Prussian army
deaths data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.4 Recursive (yearly) Bayesian estimates of the mean number of
deaths per unit-year . . . . . . . . . . . . . . . . . . . . . . .
20.5 Frequency distribution of bomb hits in greater London during
WW II and Poisson model prediction . . . . . . . . . . . . . .
20.6 US Population (to the nearest million) from 17902000 . . . .
20.7 Percent average relative population growth rate for each census
year . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.8 Response surface design and experimental results for coating
process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

659
666
675
679
693
735

759
773
779
799
801
806
808
813
813
858
859
862
866
869
871
877
880

21.1 Summary of H0 rejection conditions for the test of hypothesis

based on an exponential model of component failure-time . .

921

22.1 Measured length of samples of 6-inch nails in a manufacturing

process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.2 Hourly Mooney viscosity data . . . . . . . . . . . . . . . . . .
22.3 Number and proportion of defective mechanical pencils . . . .

949
953
956

xxviii

Contents

0 Prelude
0.1 Approach Philosophy . . . . . . . . . . . . . . . . . . . . . .
0.2 Four basic principles . . . . . . . . . . . . . . . . . . . . . . .
0.3 Summary and Conclusions . . . . . . . . . . . . . . . . . . .

1
1
3
5

Foundations

1 Two Motivating Examples

1.1 Yield Improvement in a Chemical Process . . . . . . . . .
1.1.1 The Problem . . . . . . . . . . . . . . . . . . . . . .
1.1.2 The Essence of the Problem . . . . . . . . . . . . . .
1.1.3 Preliminary Intuitive Notions . . . . . . . . . . . .
1.2 Quality Assurance in a Glass Sheet Manufacturing Process
1.3 Outline of a Systematic Approach . . . . . . . . . . . . . .
1.3.1 Group Classication and Frequency Distributions .
1.3.2 Theoretical Distributions . . . . . . . . . . . . . . .
1.4 Summary and Conclusions . . . . . . . . . . . . . . . . . .
2 Random Phenomena, Variability and Uncertainty
2.1 Two Extreme Idealizations of Natural Phenomena .
2.1.1 Introduction . . . . . . . . . . . . . . . . . .
2.1.2 A Chemical Engineering Illustration . . . . .
2.2 Random Mass Phenomena . . . . . . . . . . . . . .
2.2.1 Dening Characteristics . . . . . . . . . . . .
2.2.2 Variability and Uncertainty . . . . . . . . . .
2.2.3 Practical Problems of Interest . . . . . . . . .
2.3 Introducing Probability . . . . . . . . . . . . . . . .
2.3.1 Basic Concepts . . . . . . . . . . . . . . . . .
2.3.2 Interpreting Probability . . . . . . . . . . . .
2.4 The Probabilistic Framework . . . . . . . . . . . . .
2.5 Summary and Conclusions . . . . . . . . . . . . . .

Probability

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

11
12
12
14
14
16
17
18
22
25

.
.
.
.
.
.
.
.
.
.
.
.

33
34
34
35
41
41
42
42
43
43
44
47
48

53
xxix

xxx
3 Fundamentals of Probability Theory
3.1 Building Blocks . . . . . . . . . . . . .
3.2 Operations . . . . . . . . . . . . . . . .
3.2.1 Events, Sets and Set Operations
3.2.2 Set Functions . . . . . . . . . . .
3.2.3 Probability Set Function . . . . .
3.2.4 Final considerations . . . . . . .
3.3 Probability . . . . . . . . . . . . . . . .
3.3.1 The Calculus of Probability . . .
3.3.2 Implications . . . . . . . . . . . .
3.4 Conditional Probability . . . . . . . . .
3.4.1 Illustrating the Concept . . . . .
3.4.2 Formalizing the Concept . . . . .
3.4.3 Total Probability . . . . . . . . .
3.4.4 Bayes Rule . . . . . . . . . . . .
3.5 Independence . . . . . . . . . . . . . . .
3.6 Summary and Conclusions . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

57
58
61
61
65
67
68
69
69
71
72
72
73
74
76
77
78

4 Random Variables and Distributions

4.1 Introduction and Denition . . . . . . . . . . . . . . .
4.1.1 Mathematical Concept of the Random Variable
4.1.2 Practical Considerations . . . . . . . . . . . . .
4.1.3 Types of Random Variables . . . . . . . . . . .
4.2 Distributions . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Discrete Random Variables . . . . . . . . . . .
4.2.2 Continuous Random Variables . . . . . . . . .
4.2.3 The Probability Distribution Function . . . . .
4.3 Mathematical Expectation . . . . . . . . . . . . . . .
4.3.1 Motivating the Denition . . . . . . . . . . . .
4.3.2 Denition and Properties . . . . . . . . . . . .
4.4 Characterizing Distributions . . . . . . . . . . . . . .
4.4.1 Moments of a Distributions . . . . . . . . . . .
4.4.2 Moment Generating Function . . . . . . . . . .
4.4.3 Characteristic Function . . . . . . . . . . . . .
4.4.4 Additional Distributional Characteristics . . .
4.4.5 Entropy . . . . . . . . . . . . . . . . . . . . . .
4.4.6 Probability Bounds . . . . . . . . . . . . . . . .
4.5 Special Derived Probability Functions . . . . . . . . .
4.5.1 Survival Function . . . . . . . . . . . . . . . . .
4.5.2 Hazard Function . . . . . . . . . . . . . . . . .
4.5.3 Cumulative Hazard Function . . . . . . . . . .
4.6 Summary and Conclusions . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

89
90
90
94
94
95
95
98
100
102
102
105
107
107
113
115
116
119
119
122
122
123
124
124

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

xxxi
5 Multidimensional Random Variables
137
5.1 Introduction and Denitions . . . . . . . . . . . . . . . . . . 138
5.1.1 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . 138
5.1.2 2-Dimensional (Bivariate) Random Variables . . . . . 139
5.1.3 Higher-Dimensional (Multivariate) Random Variables
140
5.2 Distributions of Several Random Variables . . . . . . . . . . 141
5.2.1 Joint Distributions . . . . . . . . . . . . . . . . . . . . 141
5.2.2 Marginal Distributions . . . . . . . . . . . . . . . . . . 144
5.2.3 Conditional Distributions . . . . . . . . . . . . . . . . 147
5.2.4 General Extensions . . . . . . . . . . . . . . . . . . . . 153
5.3 Distributional Characteristics of Jointly Distributed Random
Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.3.1 Expectations . . . . . . . . . . . . . . . . . . . . . . . 154
5.3.2 Covariance and Correlation . . . . . . . . . . . . . . . 157
5.3.3 Independence . . . . . . . . . . . . . . . . . . . . . . . 158
5.4 Summary and Conclusions . . . . . . . . . . . . . . . . . . . 163
6 Random Variable Transformations
6.1 Introduction and Problem Denition .
6.2 Single Variable Transformations . . . .
6.2.1 Discrete Case . . . . . . . . . . .
6.2.2 Continuous Case . . . . . . . . .
6.2.3 General Continuous Case . . . .
6.2.4 Random Variable Sums . . . . .
6.3 Bivariate Transformations . . . . . . .
6.4 General Multivariate Transformations .
6.4.1 Square Transformations . . . . .
6.4.2 Non-Square Transformations . .
6.4.3 Non-Monotone Transformations .
6.5 Summary and Conclusions . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

171
172
172
173
175
176
177
182
184
184
185
188
188

7 Application Case Studies I: Probability

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Mendel and Heredity . . . . . . . . . . . . . . . . . . .
7.2.1 Background and Problem Denition . . . . . . .
7.2.2 Single Trait Experiments and Results . . . . . .
7.2.3 Single trait analysis . . . . . . . . . . . . . . . .
7.2.4 Multiple Traits and Independence . . . . . . . .
7.2.5 Subsequent Experiments and Conclusions . . . .
7.3 World War II Warship Tactical Response Under Attack
7.3.1 Background and Problem Denition . . . . . . .
7.3.2 Approach and Results . . . . . . . . . . . . . . .
7.3.3 Final Comments . . . . . . . . . . . . . . . . . .
7.4 Summary and Conclusions . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

197
198
199
199
201
201
205
208
209
209
210
212
212

.
.
.
.
.
.
.
.
.
.
.
.

xxxii

III

Distributions

213

8 Ideal Models of Discrete Random Variables

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 The Discrete Uniform Random Variable . . . . . . . . . . . .
8.2.1 Basic Characteristics and Model . . . . . . . . . . . .
8.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . .
8.3 The Bernoulli Random Variable . . . . . . . . . . . . . . . .
8.3.1 Basic Characteristics . . . . . . . . . . . . . . . . . . .
8.3.2 Model Development . . . . . . . . . . . . . . . . . . .
8.3.3 Important Mathematical Characteristics . . . . . . . .
8.4 The Hypergeometric Random Variable . . . . . . . . . . . .
8.4.1 Basic Characteristics . . . . . . . . . . . . . . . . . . .
8.4.2 Model Development . . . . . . . . . . . . . . . . . . .
8.4.3 Important Mathematical Characteristics . . . . . . . .
8.4.4 Applications . . . . . . . . . . . . . . . . . . . . . . .
8.5 The Binomial Random Variable . . . . . . . . . . . . . . . .
8.5.1 Basic Characteristics . . . . . . . . . . . . . . . . . . .
8.5.2 Model Development . . . . . . . . . . . . . . . . . . .
8.5.3 Important Mathematical Characteristics . . . . . . . .
8.5.4 Applications . . . . . . . . . . . . . . . . . . . . . . .
8.6 Extensions and Special Cases of the Binomial Random Variable
8.6.1 Trinomial Random Variable . . . . . . . . . . . . . . .
8.6.2 Multinomial Random Variable . . . . . . . . . . . . .
8.6.3 Negative Binomial Random Variable . . . . . . . . . .
8.6.4 Geometric Random Variable . . . . . . . . . . . . . .
8.7 The Poisson Random Variable . . . . . . . . . . . . . . . . .
8.7.1 The Limiting Form of a Binomial Random Variable .
8.7.2 First Principles Derivation . . . . . . . . . . . . . . . .
8.7.3 Important Mathematical Characteristics . . . . . . . .
8.7.4 Applications . . . . . . . . . . . . . . . . . . . . . . .
8.8 Summary and Conclusions . . . . . . . . . . . . . . . . . . .

217
218
219
219
220
221
221
221
222
222
222
223
224
224
225
225
225
226
227
230
230
231
232
234
236
236
237
239
240
243

9 Ideal Models of Continuous Random Variables

9.1 Gamma Family Random Variables . . . . . . . .
9.1.1 The Exponential Random Variable . . . .
9.1.2 The Gamma Random Variable . . . . . .
9.1.3 The Chi-Square Random Variable . . . .
9.1.4 The Weibull Random Variable . . . . . .
9.1.5 The Generalized Gamma Model . . . . .
9.1.6 The Poisson-Gamma Mixture Distribution
9.2 Gaussian Family Random Variables . . . . . . .
9.2.1 The Gaussian (Normal) Random Variable
9.2.2 The Standard Normal Random Variable .
9.2.3 The Lognormal Random Variable . . . . .

257
259
260
264
271
272
276
276
278
279
290
292

.
.
.
.
.
.
.
.
.
.
.

xxxiii

9.3

9.4

9.2.4
9.2.5
Ratio
9.3.1
9.3.2

The Rayleigh Random Variable . . . . . . . . . . . . .

The Generalized Gaussian Model . . . . . . . . . . . .
Family Random Variables . . . . . . . . . . . . . . . .
The Beta Random Variable . . . . . . . . . . . . . . .
Extensions and Special Cases of the Beta Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.3 The (Continuous) Uniform Random Variable . . . . .
9.3.4 Fishers F Random Variable . . . . . . . . . . . . . . .
9.3.5 Students t Random Variable . . . . . . . . . . . . . .
9.3.6 The Cauchy Random Variable . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . .

297
300
300
301
307
308
309
311
314
316

10 Information, Entropy and Probability Models

335
10.1 Uncertainty and Information . . . . . . . . . . . . . . . . . . 336
10.1.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . 336
10.1.2 Quantifying Information . . . . . . . . . . . . . . . . . 337
10.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
10.2.1 Discrete Random Variables . . . . . . . . . . . . . . . 338
10.2.2 Continuous Random Variables . . . . . . . . . . . . . 340
10.3 Maximum Entropy Principles for Probability Modeling . . . 344
10.4 Some Maximum Entropy Models . . . . . . . . . . . . . . . . 344
10.4.1 Discrete Random Variable; Known Range . . . . . . . 345
10.4.2 Discrete Random Variable; Known Mean . . . . . . . 346
10.4.3 Continuous Random Variable; Known Range . . . . . 348
10.4.4 Continuous Random Variable; Known Mean . . . . . . 349
10.4.5 Continuous Random Variable; Known Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
10.4.6 Continuous Random Variable; Known Range, Mean and
Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 351
10.5 Maximum Entropy Models from General Expectations . . . . 351
10.5.1 Single Expectations . . . . . . . . . . . . . . . . . . . 351
10.5.2 Multiple Expectations . . . . . . . . . . . . . . . . . . 353
10.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . 354
11 Application Case Studies II: In-Vitro Fertilization
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
11.2 In-Vitro Fertilization and Multiple Births . . . . . . .
11.2.1 Background and Problem Denition . . . . . .
11.2.2 Clinical Studies and Recommended Guidelines
11.3 Probability Modeling and Analysis . . . . . . . . . . .
11.3.1 Model Postulate . . . . . . . . . . . . . . . . .
11.3.2 Prediction . . . . . . . . . . . . . . . . . . . . .
11.3.3 Estimation . . . . . . . . . . . . . . . . . . . .
11.4 Binomial Model Validation . . . . . . . . . . . . . . .
11.4.1 Overview and Study Characteristics . . . . . .

.
.
.
.
.
.
.
.
.
.

363
364
365
365
367
371
371
372
373
375
375

xxxiv
11.4.2 Binomial Model versus Clinical Data . . . . . . . . . .
11.5 Problem Solution: Model-based IVF Optimization and Analysis
11.5.1 Optimization . . . . . . . . . . . . . . . . . . . . . . .
11.5.2 Model-based Analysis . . . . . . . . . . . . . . . . . .
11.5.3 Patient Categorization and Theoretical Analysis of
Treatment Outcomes . . . . . . . . . . . . . . . . . . .
11.6 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . .
11.6.1 General Discussion . . . . . . . . . . . . . . . . . . . .
11.6.2 Theoretical Sensitivity Analysis . . . . . . . . . . . . .
11.7 Summary and Conclusions . . . . . . . . . . . . . . . . . . .
11.7.1 Final Wrap-up . . . . . . . . . . . . . . . . . . . . . .
11.7.2 Conclusions and Perspectives on Previous Studies and
Guidelines . . . . . . . . . . . . . . . . . . . . . . . . .

Statistics

377
384
385
386
390
392
392
394
395
395
397

403

12 Introduction to Statistics
12.1 From Probability to Statistics . . . . . . . . . . . . . . .
12.1.1 Random Phenomena and Finite Data Sets . . . . .
12.1.2 Finite Data Sets and Statistical Analysis . . . . . .
12.1.3 Probability, Statistics and Design of Experiments .
12.1.4 Statistical Analysis . . . . . . . . . . . . . . . . . .
12.2 Variable and Data Types . . . . . . . . . . . . . . . . . .
12.3 Graphical Methods of Descriptive Statistics . . . . . . . .
12.3.1 Bar Charts and Pie Charts . . . . . . . . . . . . .
12.3.2 Frequency Distributions . . . . . . . . . . . . . . .
12.3.3 Box Plots . . . . . . . . . . . . . . . . . . . . . . .
12.3.4 Scatter Plots . . . . . . . . . . . . . . . . . . . . .
12.4 Numerical Descriptions . . . . . . . . . . . . . . . . . . .
12.4.1 Theoretical Measures of Central Tendency . . . . .
12.4.2 Measures of Central Tendency: Sample Equivalents
12.4.3 Measures of Variability . . . . . . . . . . . . . . .
12.4.4 Supplementing Numerics with Graphics . . . . . .
12.5 Summary and Conclusions . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

407
408
408
411
414
415
417
419
419
424
427
431
436
436
438
440
442
446

13 Sampling
13.1 Introductory Concepts . . . . . . . . . . . . . . . . .
13.1.1 The Random Sample . . . . . . . . . . . . . . .
13.1.2 The Statistic and its Distribution . . . . . .
13.2 The Distribution of Functions of Random Variables .
13.2.1 General Overview . . . . . . . . . . . . . . . .
13.2.2 Some Important Sampling Distribution Results
13.3 Sampling Distribution of The Mean . . . . . . . . . .
13.3.1 Underlying Probability Distribution Known . .
13.3.2 Underlying Probability Distribution Unknown .

.
.
.
.
.
.
.
.
.

459
460
460
461
463
463
463
465
465
467

.
.
.
.
.
.
.
.
.

xxxv
13.3.3 Limiting Distribution of the Mean
13.3.4 Unknown . . . . . . . . . . . . .
13.4 Sampling Distribution of the Variance . .
13.5 Summary and Conclusions . . . . . . . .

.
.
.
.

467
470
473
476

14 Estimation
487
14.1 Introductory Concepts . . . . . . . . . . . . . . . . . . . . . 488
14.1.1 An Illustration . . . . . . . . . . . . . . . . . . . . . . 488
14.1.2 Problem Denition and Key Concepts . . . . . . . . . 489
14.2 Criteria for Selecting Estimators . . . . . . . . . . . . . . . . 490
14.2.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . 490
14.2.2 Eciency . . . . . . . . . . . . . . . . . . . . . . . . . 491
14.2.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . 492
14.3 Point Estimation Methods . . . . . . . . . . . . . . . . . . . 493
14.3.1 Method of Moments . . . . . . . . . . . . . . . . . . . 493
14.3.2 Maximum Likelihood . . . . . . . . . . . . . . . . . . . 496
14.4 Precision of Point Estimates . . . . . . . . . . . . . . . . . . 503
14.5 Interval Estimates . . . . . . . . . . . . . . . . . . . . . . . . 506
14.5.1 General Principles . . . . . . . . . . . . . . . . . . . . 506
14.5.2 Mean of a Normal Population; Known . . . . . . . . 507
14.5.3 Mean of a Normal Population; Unknown . . . . . . 508
14.5.4 Variance of a Normal Population . . . . . . . . . . . . 510
14.5.5 Dierence of Two Normal Populations Means . . . . . 512
14.5.6 Interval Estimates for Parameters from other Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
14.6 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . 518
14.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 518
14.6.2 Basic Concept . . . . . . . . . . . . . . . . . . . . . . 519
14.6.3 Bayesian Estimation Results . . . . . . . . . . . . . . 520
14.6.4 A Simple Illustration . . . . . . . . . . . . . . . . . . . 521
14.6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 524
14.7 Summary and Conclusions . . . . . . . . . . . . . . . . . . . 527
15 Hypothesis Testing
15.1 Introduction . . . . . . . . . . . . . . . . . . . .
15.2 Basic Concepts . . . . . . . . . . . . . . . . . . .
15.2.1 Terminology and Denitions . . . . . . . .
15.2.2 General Procedure . . . . . . . . . . . . .
15.3 Concerning Single Mean of a Normal Population
15.3.1 Known; the z-test . . . . . . . . . . .
15.3.2 Unknown; the t-test . . . . . . . . .
15.3.3 Condence Intervals and Hypothesis Tests
15.4 Concerning Two Normal Population Means . . .
15.4.1 Population Standard Deviations Known .
15.4.2 Population Standard Deviations Unknown

.
.
.
.
.
.
.
.
.
.
.

551
552
554
554
560
561
563
570
575
576
576
578

xxxvi
15.4.3 Paired Dierences . . . . . . . . . . . . . . . . . . .
15.5 Determining , Power, and Sample Size . . . . . . . . . . .
15.5.1 and Power . . . . . . . . . . . . . . . . . . . . . .
15.5.2 Sample Size . . . . . . . . . . . . . . . . . . . . . . .
15.5.3 and Power for Lower-Tailed and Two-Sided Tests
15.5.4 General Power and Sample Size Considerations . . .
15.6 Concerning Variances of Normal Populations . . . . . . . .
15.6.1 Single Variance . . . . . . . . . . . . . . . . . . . . .
15.6.2 Two Variances . . . . . . . . . . . . . . . . . . . . .
15.7 Concerning Proportions . . . . . . . . . . . . . . . . . . . .
15.7.1 Single Population Proportion . . . . . . . . . . . . .
15.7.2 Two Population Proportions . . . . . . . . . . . . .
15.8 Concerning Non-Gaussian Populations . . . . . . . . . . .
15.8.1 Large Sample Test for Means . . . . . . . . . . . . .
15.8.2 Small Sample Tests . . . . . . . . . . . . . . . . . . .
15.9 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . .
15.9.1 General Principles . . . . . . . . . . . . . . . . . . .
15.9.2 Special Cases . . . . . . . . . . . . . . . . . . . . . .
15.9.3 Asymptotic Distribution for . . . . . . . . . . . .
15.10Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.11Summary and Conclusions . . . . . . . . . . . . . . . . . .
16 Regression Analysis
16.1 Introductory Concepts . . . . . . . . . . . . . . .
16.1.1 Dependent and Independent Variables . . .
16.1.2 The Principle of Least Squares . . . . . . .
16.2 Simple Linear Regression . . . . . . . . . . . . . .
16.2.1 One-Parameter Model . . . . . . . . . . . .
16.2.2 Two-Parameter Model . . . . . . . . . . . .
16.2.3 Properties of OLS Estimators . . . . . . . .
16.2.4 Condence Intervals . . . . . . . . . . . . .
16.2.5 Hypothesis Testing . . . . . . . . . . . . . .
16.2.6 Prediction and Prediction Intervals . . . . .
16.2.7 Coecient of Determination and the F-Test
16.2.8 Relation to the Correlation Coecient . . .
16.2.9 Mean-Centered Model . . . . . . . . . . . .
16.2.10 Residual Analysis . . . . . . . . . . . . . . .
16.3 Intrinsically Linear Regression . . . . . . . . . .
16.3.1 Linearity in Regression Models . . . . . . .
16.3.2 Variable Transformations . . . . . . . . . .
16.4 Multiple Linear Regression . . . . . . . . . . . . .
16.4.1 General Least Squares . . . . . . . . . . . .
16.4.2 Matrix Methods . . . . . . . . . . . . . . .
16.4.3 Some Important Special Cases . . . . . . .
16.4.4 Recursive Least Squares . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

585
591
591
593
598
599
600
601
603
606
607
610
613
613
614
616
616
619
622
623
624

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

647
648
650
651
652
652
653
660
661
664
668
671
676
677
678
682
682
685
686
687
688
694
698

xxxvii
16.5 Polynomial Regression . . . . . . . . . .
16.5.1 General Considerations . . . . . .
16.5.2 Orthogonal Polynomial Regression
16.6 Summary and Conclusions . . . . . . . .

.
.
.
.

700
700
704
710

17 Probability Model Validation

17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
17.2 Probability Plots . . . . . . . . . . . . . . . . . . . . .
17.2.1 Basic Principles . . . . . . . . . . . . . . . . . .
17.2.2 Transformations and Specialized Graph Papers
17.2.3 Modern Probability Plots . . . . . . . . . . . .
17.2.4 Applications . . . . . . . . . . . . . . . . . . .
17.3 Chi-Squared Goodness-of-t Test . . . . . . . . . . .
17.3.1 Basic Principles . . . . . . . . . . . . . . . . . .
17.3.2 Properties and Application . . . . . . . . . . .
17.4 Summary and Conclusions . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

731
732
733
733
734
736
737
739
739
742
745

18 Nonparametric Methods
18.1 Introduction . . . . . . . . . . . . . . . . . . . . .
18.2 Single Population . . . . . . . . . . . . . . . . . .
18.2.1 One-Sample Sign Test . . . . . . . . . . . .
18.2.2 One-Sample Wilcoxon Signed Rank Test . .
18.3 Two Populations . . . . . . . . . . . . . . . . . . .
18.3.1 Two-Sample Paired Test . . . . . . . . . . .
18.3.2 Mann-Whitney-Wilcoxon Test . . . . . . .
18.4 Probability Model Validation . . . . . . . . . . . .
18.4.1 The Kolmogorov-Smirnov Test . . . . . . .
18.4.2 The Anderson-Darling Test . . . . . . . . .
18.5 A Comprehensive Illustrative Example . . . . . .
18.5.1 Probability Model Postulate and Validation
18.5.2 Mann-Whitney-Wilcoxon Test . . . . . . .
18.6 Summary and Conclusions . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

757
758
760
760
763
765
766
766
770
770
771
772
772
775
777

19 Design of Experiments
19.1 Introductory Concepts . . . . . . . . . . . . .
19.1.1 Experimental Studies and Design . . . .
19.1.2 Phases of Ecient Experimental Studies
19.1.3 Problem Denition and Terminology . .
19.2 Analysis of Variance . . . . . . . . . . . . . . .
19.3 Single Factor Experiments . . . . . . . . . . .
19.3.1 One-Way Classication . . . . . . . . .
19.3.2 Kruskal-Wallis Nonparametric Test . . .
19.3.3 Two-Way Classication . . . . . . . . .
19.3.4 Other Extensions . . . . . . . . . . . . .
19.4 Two-Factor Experiments . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

791
793
793
794
795
796
797
797
805
805
811
811

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

xxxviii
19.5 General Multi-factor Experiments . . .
19.6 2k Factorial Experiments and Design .
19.6.1 Overview . . . . . . . . . . . . .
19.6.2 Design and Analysis . . . . . . .
19.6.3 Procedure . . . . . . . . . . . . .
19.6.4 Closing Remarks . . . . . . . . .
19.7 Screening Designs: Fractional Factorial
19.7.1 Rationale . . . . . . . . . . . . .
19.7.2 Illustrating the Mechanics . . . .
19.7.3 General characteristics . . . . . .
19.7.4 Design and Analysis . . . . . . .
19.7.5 A Practical Illustrative Example
19.8 Screening Designs: Plackett-Burman . .
19.8.1 Primary Characteristics . . . . .
19.8.2 Design and Analysis . . . . . . .
19.9 Response Surface Designs . . . . . . . .
19.9.1 Characteristics . . . . . . . . . .
19.9.2 Response Surface Designs . . . .
19.9.3 Design and Analysis . . . . . . .
19.10Introduction to Optimal Designs . . . .
19.10.1 Background . . . . . . . . . . . .
19.10.2 Alphabetic Optimal Designs .
19.11Summary and Conclusions . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

814
814
814
816
817
821
822
822
822
823
825
827
832
833
833
834
834
835
836
837
837
838
839

20 Application Case Studies III: Statistics

855
20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 856
20.2 Prussian Army Death-by-Horse kicks . . . . . . . . . . . . . 857
20.2.1 Background and Data . . . . . . . . . . . . . . . . . . 857
20.2.2 Parameter Estimation and Model Validation . . . . . 859
20.2.3 Recursive Bayesian Estimation . . . . . . . . . . . . . 860
20.3 WW II Aerial Bombardment of London . . . . . . . . . . . . 868
20.4 US Population Dynamics: 1790-2000 . . . . . . . . . . . . . . 870
20.4.1 Background and Data . . . . . . . . . . . . . . . . . . 870
20.4.2 Truncated Data Modeling and Evaluation . . . . . . 872
20.4.3 Full Data Set Modeling and Evaluation . . . . . . . . 873
20.4.4 Hypothesis Testing Concerning Average Population
Growth Rate . . . . . . . . . . . . . . . . . . . . . . . 876
20.5 Process Optimization . . . . . . . . . . . . . . . . . . . . . . 879
20.5.1 Problem Denition and Background . . . . . . . . . . 879
20.5.2 Experimental Strategy and Results . . . . . . . . . . . 879
20.5.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 880
20.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . 889

Applications

895

xxxix
21 Reliability and Life Testing
21.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
21.2 System Reliability . . . . . . . . . . . . . . . . . . .
21.2.1 Simple Systems . . . . . . . . . . . . . . . . .
21.2.2 Complex Systems . . . . . . . . . . . . . . . .
21.3 System Lifetime and Failure-Time Distributions . .
21.3.1 Characterizing Time-to-Failure . . . . . . . .
21.3.2 Probability Models for Distribution of Failure
21.4 The Exponential Reliability Model . . . . . . . . . .
21.4.1 Component Characteristics . . . . . . . . . .
21.4.2 Series Conguration . . . . . . . . . . . . . .
21.4.3 Parallel Conguration . . . . . . . . . . . . .
21.4.4 m-of-n Parallel Systems . . . . . . . . . . . .
21.5 The Weibull Reliability Model . . . . . . . . . . . .
21.6 Life Testing . . . . . . . . . . . . . . . . . . . . . . .
21.6.1 The Exponential Model . . . . . . . . . . . .
21.6.2 The Weibull Model . . . . . . . . . . . . . . .
21.7 Summary and Conclusions . . . . . . . . . . . . . .

. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Times
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

22 Quality Assurance and Control

22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
22.2 Acceptance Sampling . . . . . . . . . . . . . . . . . . .
22.2.1 Basic Principles . . . . . . . . . . . . . . . . . . .
22.2.2 Determining a Sampling Plan . . . . . . . . . . .
22.3 Process and Quality Control . . . . . . . . . . . . . . .
22.3.1 Underlying Philosophy . . . . . . . . . . . . . . .
22.3.2 Statistical Process Control . . . . . . . . . . . . .
22.3.3 Basic Control Charts . . . . . . . . . . . . . . . .
22.3.4 Enhancements . . . . . . . . . . . . . . . . . . .
22.4 Chemical Process Control . . . . . . . . . . . . . . . . .
22.4.1 Preliminary Considerations . . . . . . . . . . . .
22.4.2 Statistical Process Control (SPC) Perspective . .
22.4.3 Engineering/Automatic Process Control (APC)
spective . . . . . . . . . . . . . . . . . . . . . . .
22.4.4 SPC or APC . . . . . . . . . . . . . . . . . . . .
22.5 Process and Parameter Design . . . . . . . . . . . . . .
22.5.1 Basic Principles . . . . . . . . . . . . . . . . . . .
22.5.2 A Theoretical Rationale . . . . . . . . . . . . . .
22.6 Summary and Conclusions . . . . . . . . . . . . . . . .

. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
Per. . .
. . .
. . .
. . .
. . .
. . .

23 Introduction to Multivariate Analysis

23.1 Multivariate Probability Models . . . . . . .
23.1.1 Introduction . . . . . . . . . . . . . .
23.1.2 The Multivariate Normal Distribution
23.1.3 The Wishart Distribution . . . . . . .

.
.
.
.

899
900
901
901
906
911
911
913
914
914
915
916
917
918
919
919
922
923
933
934
936
936
938
944
944
944
946
958
964
964
965
966
967
969
969
970
971
977
978
978
979
981

xl
23.1.4 Hotellings T -Squared Distribution
23.1.5 The Wilks Lambda Distribution .
23.1.6 The Dirichlet Distribution . . . . .
23.2 Multivariate Data Analysis . . . . . . . .
23.3 Principal Components Analysis . . . . .
23.3.1 Basic Principles of PCA . . . . . .
23.3.2 Main Characteristics of PCA . . .
23.3.3 Illustrative example . . . . . . . .
23.3.4 Other Applications of PCA . . . .
23.4 Summary and Conclusions . . . . . . . .

.
.
.
.
.
.
.
.
.
.

982
982
983
984
985
986
990
991
999
1002

Appendix

1005

Index

1009

Chapter 0
Prelude

0.1
0.2
0.3

Approach Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Four basic principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
3
5

Rem tene; verba sequentur.

(Grasp the subject; the words will follow.)
Cato the Elder (234149 BC)

From weather forecasts and life insurance premiums for non-smokers to clinical
tests of experimental drugs and defect rates in manufacturing facilities, and
in numerous other ways, randomly varying phenomena exert a subtle but pervasive inuence on everyday life. In most cases, one can be blissfully ignorant
of the true implications of the presence of such phenomena without consequence. In science and engineering, however, the inuence of randomly varying phenomena can be such that even apparently simple problems can become
dramatically complicated by the presence of random variabilitydemanding
special methods and analysis tools for obtaining valid and useful solutions.

The primary aim of this book is to provide the reader with the basic fundamental principles, methods, and tools for formulating and
solving engineering problems involving randomly varying phenomena.

Since this aim can be achieved in several dierent ways, this chapter is
devoted to presenting this books approach philosophy.

0.1

Approach Philosophy

Engineers are typically well-trained in the art of problem formulation

and problem solving when all the entities involved are considered deterministic in character. However, many problems of practical importance involve
1

Random Phenomena

randomly varying phenomena of one sort or another; and the vast majority of
such problems cannot always be idealized and reduced to the more familiar
deterministic types without destroying the very essence of the problem. For
example, in determining which of two catalysts A or B provides the greater
yield in a chemical manufacturing process , it is well-known that the respective
yields YA and YB , as observed experimentally, are randomly varying quantities. Chapter 1 presents a full-scale discussion of this problem. For now, we
simply note that with catalyst A, fty dierent experiments performed under
essentially identical conditions will result in fty dierent values (realizations)
for YA . Similarly for catalyst B, one obtains fty distinct values for YB from
fty dierent experiments replicated under identical conditions. The rst 10
experimental data points for this example are shown in the table below.
YA % YB %
74.04 75.75
75.29 68.41
75.62 74.19
75.91 68.10
77.21 68.10
75.07 69.23
74.23 70.14
74.92 69.22
76.57 74.17
77.77 70.23
Observe that because of the variability inherent in the data, some of the YA
values are greater than some of the YB values; but the converse is also true
some YB values are greater than some YA values. So how does one determine
reliably and condentlywhich catalyst (if any) really provides the greater
yield? Clearly, special methods and analysis tools are required for handling
this apparently simple problem: the deterministic idealization of comparing a
single observed value of YA (say the rst entry, 74.04) with a corresponding
single observed value of YB (in this case 75.75) is incapable of producing a
valid answer. The primary essence of this problem is the variability inherent
in the data which masks the fact that one catalyst does in fact provide the
greater yield.
This book takes a more fundamental, rst-principles approach to the
issue of dealing with random variability and uncertainty in engineering problems. This is in contrast to the typical engineering statistics approach on the
one hand, or the problem-specic approach on the other. With the former
approach, most of the emphasis is on how to use certain popular statistical
techniques to solve some of the most commonly encountered engineering problems, with little or no discussion of why the techniques are eective. With the
latter approach, a particular topic (say Design of Experiments) is selected
and dealt with in depth, and the appropriate statistical tools are presented
and discussed within the context of the specic problem at the core of the

Prelude

selected topic. By denition, such an approach excludes all other topics that
may be of practical interest, opting to make up in depth what it gives up in
breadth.

The approach taken in this book is based on the premise that emphasizing fundamentals and basic principles, and then illustrating
these with examples, equips the reader with the means of dealing
with a range of problems wider than that explicitly covered in the
book.

This approach philosophy is based on the four basic principles discussed

next.

0.2

Four basic principles

1. If characterized properly, random phenomena are subject to rigorous mathematical analysis in much the same manner as deterministic phenomena.
Random phenomena are so-called because they show no apparent regularity, appearing to occur haphazardlytotally at random; the observed variations do not seem to obey any discernible rational laws and therefore appear to
be entirely unpredictable. However, the unpredictable irregularities of the individual observations (or, more generally, the detail) of random phenomena
in fact co-exist with surprisingly predictable ensemble, or aggregate, behavior. This fact makes rigorous analysis possible; it also provides the basis for
employing the concept and calculus of probability to develop a systematic
framework for characterizing random phenomena in terms of probability distribution functions.
The rst order of business is therefore to seek to understand random phenomena and to develop techniques for characterizing them appropriately. Part
I, titled FOUNDATIONS: Understanding Random Variability, and Part II,
titled PROBABILITY: Characterizing Random Variability, are devoted to
these respective tasks. Ultimately, probabilityand the probability distribution functionare introduced as the theoretical constructs for eciently describing our knowledge of the real-world phenomena in question.
2. By focusing on the underlying phenomenological mechanisms , it is possible
to develop appropriate theoretical characterizations of random phenomena in
terms of ideal models of the observed variability.
Within the probabilistic framework, the ensemble, or aggregate behavior

Random Phenomena

of the random phenomenon in question is characterized by its probability

distribution function. In much the same way that theoretical mathematical models are derived from rst-principles for deterministic phenomena, it
is also possible to derive these theoretical probability distribution functions
as ideal models that describe our knowledge of the underlying random phenomena. Part III, titled DISTRIBUTIONS: Modeling Random Variability, is
devoted to the important tasks of developing and analyzing ideal probability
models for many random phenomena of practical interest. The end result is a
collection of probability distribution functions each derived directly fromand
hence explicitly linked tothe underlying random phenomenological mechanisms.
3. The ensemble (or aggregate) characterization provided by ideal probability
models can be used successfully to develop the theoretical basis for solving real
problems where one is always limited to dealing with an incomplete collection
of individual observationsnever the entire aggregate.
A key dening characteristic of random phenomena is that specic outcomes or observations cannot be predicted with absolute certainty. With probabilistic analysis, this otherwise impossible task of predicting the unpredictable individual observation or outcome is simply replaced by the analytical
task of determining the mathematical probability of its occurrence. In many
practical problems involving random phenomena, however, there is no avoiding this impossible task: one is required to deal with, and make decisions
about, individual observations, and must therefore confront the inevitable uncertainty that will always be associated with such decisions. Statistical Theory,
using the aggregate descriptions of probability theory, provides a rational basis
not only for making these predictions and decisions about individual observations with condence, but also for quantifying the degree of uncertainty
associated with such decisions.
Part IV, titled STATISTICS: Quantifying Random Variability, is devoted
to elucidating statistical principles and concepts required for dealing eectively
with data as collections of individual observations from random phenomena.
4. The usefulness and broad applicability of the fundamental principles, analysis methods, and tools provided by probability and statistics are best illustrated
with several actual example topics of engineering applications involving random phenomena.
The manifestations of random phenomena in problems of practical interest are countless, and the range of such problems is itself quite broad: from
simple data analysis and experimental designs, to polynomial curve-tting,
and empirical modeling of complex dynamic systems; from quality assurance
and control, to state and parameter estimation, and process monitoring and
diagnosis, . . . etc. The topical headings under which such problems may be
organizedDesign of Experiments; Regression Analysis; Time Series Analysis; etcare numerous, and many books have been devoted to each one of

Prelude

them. Clearly then, the sheer vastness of the subject matter of engineering
applications of probability and statistics renders completely unreasonable any
hope of comprehensive coverage in a single introductory text.
Nevertheless, how probability and statistics are employed in practice to
deal successfully with various problems created by random variability and
uncertainty can be discussed in such a way as to equip the student with
the tools needed to approach, with condence, other problems that are not
addressed explicitly in this book.
Part V, titled APPLICATIONS: Dealing with Random Variability in Practice, consists of three chapters each devoted to a specic application topic of
importance in engineering practice. Entire books have been written, and entire courses taught, on each of the topics to which we will devote only one
chapter; the coverage is therefore designed to be more illustrative than comprehensive, providing the basis for absorbing and employing more eciently,
the more extensive material presented in these other books or courses.

0.3

Summary and Conclusions

This chapter has been primarily concerned with setting forth this books
approach to presenting the fundamentals and engineering applications of probability and statistics. The four basic principles on which the more fundamental,
rst principles approach is based were presented, providing the rationale for
the scope and organization of the material to be presented in the rest of the
book.
The approach is designed to produce the following result:

A course of study based on this book should provide the reader with
a reasonable fundamental understanding of random phenomena, a
working knowledge of how to model and analyze such phenomena,
and facility with using probability and statistics to cope with random variability and uncertainty in some key engineering problems.

The book should also prepare the student to absorb and employ the material presented in more problem-specic courses such as Design of Experiments,
Time Series Analysis, Regression Analysis, Statistical Process Control, etc, a
bit more eciently.

Random Phenomena

Part I

Foundations

Part I: Foundations
Understanding Random Variability

I shall light a candle of understanding in thine heart which shall

not be put out.
Apocrypha: I Esdras 14:25

Part I: Foundations
Understanding Random Variability

Chapter 1: Two Motivating Examples

Chapter 2: Random Phenomena, Variability and Uncertainty

Chapter 1
Two Motivating Examples

1.1

1.2
1.3

1.4

Yield Improvement in a Chemical Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.1.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.2 The Essence of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.3 Preliminary Intuitive Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Quality Assurance in a Glass Sheet Manufacturing Process . . . . . . . . . . . . .
Outline of a Systematic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Group Classication and Frequency Distributions . . . . . . . . . . . . . . .
1.3.2 Theoretical Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Preview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11
12
12
14
14
15
17
18
22
23
25
26
27
28

And coming events cast their shadows before.

(Lochiels warning.)
Thomas Campbell (17771844)

When random variability is genuinely intrinsic to a problem, uncertainty

becomes inevitable, but the problem can still be solved systematically and
with condence. This the underlying theme of Applied Probability and
Statistics is what this chapter seeks to illustrate with two representative
examples. The example problems and the accompanying discussions are intended to serve two main purposes: (i) illustrate the sort of complications
caused by the presence of random components in practical problems; and (ii)
demonstrate (qualitatively for now) how to solve such problems by formulating them properly and employing appropriate methods and tools. The primary
value of this chapter is as a vehicle for placing in context this books approach
to analyzing randomly varying phenomena in engineering and science. It allows us to preview and motivate the key concepts to be developed fully in the
remaining chapters.
11

Random Phenomena

1.1

Yield Improvement in a Chemical Process

To an engineer or scientist, determining which of two numbers is larger,

and by how much, is trivial, in principle requiring no more than the elementary
arithmetic operation of subtraction. Identify the two numbers as individual observations from two randomly varying quantities, however, and the character
of the problem changes signicantly: determiningwith any certaintywhich
of the random quantities is larger and precisely by how much now requires
more than mere subtraction. This is the case with our rst example.

1.1.1

The Problem

A chemical process using catalyst A (process A) is being considered as

an alternative to the incumbent process using a dierent catalyst B (process
B). The decision in favor of one process over the other is to be based on
a comparison of the yield YA obtained from the challenger, and YB from the
incumbent, in conjunction with the following economic considerations :
Achieving target prot objectives for the nished product line requires
a process yield consistently at or above 74.5%.
For process A to be a viable alternative, at the barest minimum, its yield
must be higher than that for process B.
Every 1% yield increase over what is currently achievable with the incumbent process translates to signicant after tax operating income;
however, catalyst A used by the alternative process costs more than
catalyst B used by the incumbent process. Including the additional cost
of the process modications required to implement the new technology,
a shift to catalyst A and the new process will be economically viable only
if the resulting yield increase exceeds 2%.
The result of a series of 50 experiments carefully performed on each
process to determine YA and YB is shown in Table 1.1. Given only the supplied
data, what should the Vice President/General Manager of this business do:
authorize a switch to the new process A or stay with the incumbent process?
Mathematical Formulation
Observe that solving this problem requires nding appropriate answers to
the following mathematical questions:
1. Is YA 74.5, and YB 74.5, consistently?
2. Is YA > YB ?

Two Motivating Examples

TABLE 1.1:

Yield Data
for Process A versus Process B
YA %
YB %
74.04 75.29 75.75 68.41
75.63 75.92 74.19 68.10
77.21 75.07 68.10 69.23
74.23 74.92 70.14 69.23
76.58 77.77 74.17 70.24
75.05 74.90 70.09 71.91
75.69 75.31 72.63 78.41
75.19 77.93 71.16 73.37
75.37 74.78 70.27 73.64
74.47 72.99 75.82 74.42
73.99 73.32 72.14 78.49
74.90 74.88 74.88 76.33
75.78 79.07 70.89 71.07
75.09 73.87 72.39 72.04
73.88 74.23 74.94 70.02
76.98 74.85 75.64 74.62
75.80 75.22 75.70 67.33
77.53 73.99 72.49 71.71
72.30 76.56 69.98 72.90
77.25 78.31 70.15 70.14
75.06 76.06 74.09 68.78
74.82 75.28 72.91 72.49
76.67 74.39 75.40 76.47
76.79 77.57 69.38 75.47
75.85 77.31 71.37 74.12

Random Phenomena
3. If yes, is YA YB > 2?

Clearly, making the proper decision hinges on our ability to answer these
questions with condence.

1.1.2

The Essence of the Problem

Observe that the real essence of the problem is random variability: if each
experiment had resulted in the same single, constant number for YA and another for YB , the problem would be deterministic in character, and each of
the 3 associated questions would be trivial to answer. Instead, the random
phenomena inherent in the experimental determination of the true process
yields have been manifested in the observed variability, so that we are uncertain about the true values of YA and YB , making it not quite as trivial to
solve the problem.
The sources of variability in this case can be shown to include the measurement procedure, the measurement device itself, raw materials, and process
conditions. The observed variability is therefore intrinsic to the problem and
cannot be idealized away. There is no other way to solve this problem rationally without dealing directly with the random variability.
Next, note that YA and YB data (observations) take on values on a continuous scale i.e. yield values are real and can be located anywhere on the
real line, as opposed to quantities that can take on integer values only (as is
the case with the second example discussed later). The variables YA and YB
are therefore said to be continuous and this example illustrates decisionmaking under uncertainty when the random phenomena in question involve
continuous variables.
The main issues with this problem are as follows:
1. Characterization: How should the quantities YA and YB be characterized
so that the questions raised above can be answered properly?
2. Quantication: Are there such things as true values of the quantities
YA and YB ? If so, how should these true values be best quantied?
3. Application: How should the characterization and quantication of YA
and YB be used to answer the 3 questions raised above?

1.1.3

Preliminary Intuitive Notions

Before outlining procedures for solving this problem, it is helpful to entertain some notions that the intuition of a good scientist or engineer will
suggest. For instance, the concept of the arithmetic average of a collection
of n data points, x1 , x2 , x3 , . . . , xn , dened by:
1
xi
n i=1
n

x
=

(1.1)

Two Motivating Examples

is well-known to all scientists and engineers, and the intuitive notion of employing this single computed value to represent the data set is almost instinctive.
It seems reasonable therefore to consider representing YA with the computed
average obtained from the data, i.e. yA = 75.52, and similarly, representing
YB with yB = 72.47. We may now observe right away that yA > yB , which
now seems to suggest not only that YA > YB , but since yA yB = 3.05, that
the dierence in fact exceeds the threshold of 2%.
As intuitively appealing as these arguments might be, they raise some
important additional questions:
1. The variability of individual values of the data yAi around the average
value yA = 75.52 is noticeable; that of yBi around the average value yB =
72.47 even more so. How condent then are we about the arguments
presented above, and in the implied recommendation to prefer process
A to B, based as they are on the computed averages? (For example,
there are some 8 values of yBi > yA ; what should we make of this fact?)
2. Will it (or should it) matter that
72.30 < yAi < 79.07
67.33 < yBi < 78.41

(1.2)

so that the observed data are seen to vary over a range of yield values
that is 11.08 units wide for process B as opposed to 6.77 for A? The
averages give no indication of these extents of variability.
3. More fundamentally, is it always a good idea to work with averages? How
reasonable is it to characterize the entire data set with the average?
4. If new sets of data are gathered, the new averages computed from them
will almost surely dier from the corresponding values computed from
the current set of data shown here. Observe therefore that the computed
averages yA and yB are themselves clearly subject to random variability.
How can we then be sure that using averages oers any advantages,
since, like the original data, these averages are also not free from random
variability?
5. How were the data themselves collected? What does it mean concretely
that the 50 experiments were carefully performed? Is it possible that
the experimental protocols used may have impaired our ability to answer
the questions posed above adequately? Conversely, are there protocols
that are particularly calibrated to improve our ability to answer these
questions adequately?
Obviously therefore there is a lot more to dealing with this example problem
than merely using the intuitively appealing notion of averages.
Let us now consider a second, dierent but somewhat complementary,
example.

Random Phenomena

TABLE 1.2:
inclusions
glass sheets
0 1 1
2 0 2
1 2 0
1 1 5
2 1 0
1 0 0

1.2

Number of
on sixty 1-sq meter
1
2
1
2
0
2

0
3
0
0
1
4

0
2
1
0
1
0

1
0
0
1
0
1

0
0
0
4
0
1

2
2
1
1
1
0

2
0
1
1
1
1

Quality Assurance in a Glass Sheet Manufacturing

Process

A key measure of product quality in a glass sheet manufacturing process is

the optical attribute known as inclusions particulate aws (of size exceeding 0.5 m) included in an otherwise perfectly clear glass sheet. While it is
all but inevitable to nd inclusions in some products, the best manufacturers
produce remarkably few glass sheets with these imperfections; and even then,
the actual number of inclusions on these imperfect sheets is itself usually very
low, perhaps 1 or 2.
The specic example in question involves a manufacturer of 1 sq. meter
sheets of glass used for various types of building windows. Prior to shipping
a batch of manufactured product to customers, a sample of glass sheets from
the batch is sent to the companys Quality Control (QC) laboratory where an
optical scanning device is used to determine X, the number of inclusions in
each square-meter sheet. The results for 60 samples from a particular batch
is shown in Table 1.2.
This particular set of results caused the supervising QC engineer some
concern for the following reasons:
1. Historically, the manufacturing process hardly ever produces sheets with
more than 3 inclusions per square meter; this batch of 60 has three such
sheets: two with 4 inclusions, and one with 5.
2. Each 1 sq-m sheet with 3 or fewer inclusions is acceptable and can
be sold to customers unconditionally; sheets with 4 inclusions are
marginally acceptable so long as a batch of 1000 sheets does not contain
more than 20 such sheets; a sheet with 5 or more inclusions is unacceptable and cannot be shipped to customers. All such sheets found by
a customer are sent back to the manufacturer (at the manufacturers
expense) for a full refund. The specic sheet of this type contained in
this sample of 60 must therefore be found and eliminated.
3. More importantly, the manufacturing process was designed such that

Two Motivating Examples

when operated properly, there will be no more than 3 unacceptable

sheets with 5 or more inclusions in each batch of 1000 sheets. The process will be uneconomical otherwise.
The question of interest is this: Does the QC engineer have a reason to be
concerned? Or, stated mathematically, if X is the design value of the number
of inclusions per sq. m. associated with the sheets produced by this process,
is there evidence in this sample data that X > X so that steps will have to
be taken to identify the source of this process performance degradation and
then to rectify the problem in order to improve the process performance?
As with the rst problem, the primary issue here is also the randomness associated with the variable of interest, X, the number of inclusions per square
meter of each glass sheet. The value observed for X in the QC lab is a randomly varying quantity, not xed and deterministic. In this case, however,
there is little or no contribution to the observed variability from the measurement device: these particulate inclusions are relatively few in number and
are easily counted without error by the optical device. The variability in raw
material characteristics, and in particular the control systems eectiveness in
maintaining the process conditions at desired values (in the face of inevitable
and unpredictable disturbances to the process) all contribute to whether or
not there are imperfections, how many there are per square meter, and where
they are located on the sheet. Some sheets come out awless while others end
up with a varying number of inclusions that cannot be predicted precisely `
apriori. Thus, once again, the observed variability must be dealt with directly
because it is intrinsic to the problem and cannot be idealized away.
Next, note that the data in Table 1.2, being counts of distinct entities,
take on integer values. The variable X is therefore said to be discrete, so
that this example illustrates decision-making when the random phenomena in
question involve discrete variables.

1.3

Outline of a Systematic Approach

Even though the two illustrative problems presented above are dierent in
so many ways (one involves continuous variables, the other a discrete variable;
one is concerned with comparing two entities to each other, the other pits a
single set of data against a design target), the systematic approach to solving
such problems provided by probability and statistics applies to both in a
unied way. The fundamental issues at stake may be stated as follows:
In light of its dening characteristics of intrinsic variability, how
should randomly varying quantities be characterized and quantied precisely in order to facilitate the solution of practical problems
involving them?

Random Phenomena

TABLE 1.3:

Group classication
and frequencies for YA data (from the
proposed process)
Relative
YA group Frequency Frequency
71.51-72.50
1
0.02
2
0.04
72.51-73.50
9
0.18
73.51-74.50
74.51-75.50
17
0.34
75.51-76.50
7
0.14
8
0.16
76.51-77.50
77.51-78.50
5
0.10
78.51-79.50
1
0.02
TOTAL

1.00

What now follows is a somewhat informal examination of the ideas and concepts behind these time-tested techniques. The purpose is to motivate and
provide context for the more formal discussions in upcoming chapters.

1.3.1

Group Classication and Frequency Distributions

Let us revisit the example data sets and consider the following alternative
approach to the data representation. Instead of focusing on individual observations as presented in the tables of raw data, what if we sub-divided the
observations into small groups (called bins) and re-organized the raw data
in terms of how frequently members of each group occur? One possible result
is shown in Tables 12.3 and 1.4 respectively for process A and process B. (A
dierent bin size will lead to a slightly dierent group classication but the
principles remain the same.)
This reclassication indicates, for instance, that for YA , there is only one
observation between 71.51 and 72.50 (the actual number is 72.30), but there
are 17 observations between 74.51 and 75.50; for YB on the other hand, 3
observations fall in the [67.51-68.50] group whereas there are 8 observations
between 69.51 and 70.50. The relative frequency column indicates what
proportion of the original 50 data points are found in each group. A plot of
this reorganization of the data, known as the histogram, is shown in Figure
12.8 for YA and Figure 1.2 for YB .
The histogram, a term rst used by Pearson in 1895, is a graphical representation of data from a group-classication and frequency-of-occurrence
perspective. Each bar represents a distinct group (or class) within the data
set, with the bar height proportional to the group frequency. Because this
graphical representation provides a picture of how the data are distributed
in terms of the frequency of occurrence of each group (how much each group

Two Motivating Examples

TABLE 1.4:

Group classication
and frequencies for YB data (from the
incumbent process)
Relative
YB group Frequency Frequency
66.51-67.50
1
0.02
3
0.06
67.51-68.50
68.51-69.50
4
0.08
8
0.16
69.51-70.50
4
0.04
70.51-71.50
71.51-72.50
7
0.14
4
0.08
72.51-73.50
6
0.12
73.51-74.50
74.51-75.50
5
0.10
6
0.12
75.51-76.50
0
0.00
76.51-77.50
77.51-78.50
2
0.04
0
0.00
78.51-79.50
TOTAL

1.00

18
16
14

Frequency

12
10
8
6
4
2
0

75
YA

FIGURE 1.1: Histogram for YA data

Random Phenomena
9
8
7

Frequency

6
5
4
3
2
1
0

FIGURE 1.2: Histogram for YB data

contributes to the data set), it is often referred to as a frequency distribution

of the data.
A key advantage of such a representation is how clearly it portrays the
nature of the variability associated with each variable. For example, we easily
see from Fig 12.8 that the center of action for the YA data is somewhere
around the group whose bar is centered around 75 (i.e. in the interval [74.51,
75.50]). Furthermore, most of the values of YA cluster in the 4 central groups
centered around 74, 75, 76 and 77. In fact, 41 out of the 50 observations, or
82%, fall into these 4 groups; groups further away from the center of action
(to the left as well as to the right) contribute less to the YA data. Similarly,
Fig 1.2 shows that the center of action for the YB data is located somewhere
around the group in the [71.51, 72.50] interval but it is not as sharply dened
as it was with YA . Also the values of YB are more spread out and do not
cluster as tightly around this central group.
The histogram also provides quantitative insight. For example, we see that
38 of the 50 YA observations (or 76%) are greater than 74.51; only 13 out
of the 50 YB observations (or 26%) fall into this category. Also, exactly 0%
of YA observations are less than or equal to 71.50 compared with 20 out
of 50 observations (or a staggering 40%) of YB observations. Thus, if these
data sets can be considered as representative of the overall performance of
each process, then it is reasonable to conclude, for example, that there is a
better chance of obtaining yields greater than 74.50 with process A than with
process B (a 76% chance compared to a 26% chance). Similarly, while it is
highly unlikely that process A will ever return yields less than 71.50, there is
a not-insignicant chance (40%) that the yield obtained from process B will
be less than 71.50. What is thus beginning to emerge are the faint outlines of

Two Motivating Examples

TABLE 1.5:

Group
classication and frequencies for the
inclusions data
Relative
Frequency Frequency
X
0
22
0.367
23
0.383
1
11
0.183
2
3
1
0.017
4
2
0.033
1
0.017
5
6
0
0.000
TOTAL

1.000

a rigorous framework for characterizing and quantifying random variability,

with the histogram providing this rst glimpse.
It is important to note that the advantage provided by the histogram
comes at the expense of losing the individuality of each observation. Having
gone from 50 raw observations each to 8 groups for YA , and a slightly larger
12 groups for YB , there is clearly a loss of resolution: the individual identities
of the original observations are no longer visible from the histogram. (For
example, the identities of each of the 17 YA observations that make up the
group in the interval [74.51,75.50] have been melded into that of a single,
monolithic bar in the histogram.) But this is not necessarily a bad thing. As we
demonstrate in upcoming chapters, a fundamental tenet of the probabilistic
approach to dealing with randomly varying phenomena is an abandonment
of the individual observation as the basis for theoretical characterization, in
favor of an ensemble description. For now, it suces to be able to see from this
example that the clarity with which the histogram portrays data variability
has been achieved by trading o the individual observations identity for the
ensemble identity of groups. But keep in mind that what the histogram oers
is simply an alternative (albeit more informative) way of representing the same
identical information contained in the data tables.
Let us now return to the second problem. In this case, the group classication and frequency distribution for the raw inclusions data is shown in
Table 1.5. Let it not be lost on the reader that while the groups for the yield
data sets were created from intervals of nite length, no such quantization is
necessary for the inclusions data since in this case, the variable of interest,
X, is naturally discrete. This fundamental dierence between continuous variables (such as YA and YB ) and discrete variables (such as X) will continue to
surface at various stages in subsequent discussions.
The histogram for the inclusions data is shown in Fig 1.3 where several
characteristics are now clear: for example, 75% of the glass sheets (45 out of 60)
are either perfect or have only a single (almost inconsequential) inclusion; only

Random Phenomena
25

Frequency

Inclusions

FIGURE 1.3: Histogram of inclusions data

5% of the glass sheets (3 out of 60) have more than 3 inclusions, the remaining
95% have 3 or fewer; 93.3% (56 out of 60) have 2 or fewer inclusions. The
important point is that such quantitative characteristics of the data variability
(made possible by the histogram) is potentially useful for answering practical
questions about what one can reasonably expect from this process.

1.3.2

Theoretical Distributions

How can the benets of the histogram be consolidated into a useful tool
for quantitative analysis of randomly varying phenomena? The answer: by appealing to a fundamental axiom of random phenomena: that conceptually, as
more observations are made, the shape of the data histogram stabilizes, and
tends to the form of the theoretical distribution that characterizes the random
phenomenon in question, in the limit as the total number of observations approaches innity. It is important to note that this concept does not necessarily
require that an innite number of observations actually be obtained in practice, even if this were possible. The essence of the concept is that an underlying
theoretical distribution exists for which the frequency distribution represented
by the histogram is but a nite sample approximation; that the underlying theoretical distribution is an ideal model of the particular phenomenon
responsible for generating the nite number of observations contained in the
current data set; and hence that this theoretical distribution provides a reasonable mathematical characterization of the random phenomenon.
As we show later, these theoretical distributions may be derived from rst
principles given sucient knowledge regarding the underlying random phenomena. And, as the brief informal examination of the illustrative histograms

Two Motivating Examples

above indicates, these theoretical distributions can be used for various things.
For example, even though we have not yet provided any concrete denition
of the term probability, neither have we given any concrete justications of
its usage in this context, still from the discussion in the previous section, the
reader can intuitively attest to the reasonableness of the following statements:
the probability that YA 74.5 is 0.76; or the probability that YB 74.5
is 0.26; or the probability that X 1 is 0.75. Parts II and III are
devoted to establishing these ideas more concretely and more precisely.
A Preview
It turns out that the theoretical distribution for each yield data set is:
f (y|, ) =

(y)2
1
e 22 ; < y <
2

(1.3)

which, when superimposed on each histogram, is shown in Fig 1.4 for YA , and
Fig 1.5 for YB , when the indicated characteristic parameters are specied
as = 75.52, = 1.43 for YA , and = 72.47, = 2.76 for YB .
Similarly, the theoretical distribution for the inclusions data is:
e x
; x = 0, 1, 2, . . .
(1.4)
x!
where the characteristic parameter = 1.02 is the average number of inclusions in each glass sheet. In similar fashion to Eq 4.155, it also provides
a theoretical characterization and quantication of the random phenomenon
responsible for the variability observed in the inclusions data. From it we
are able, for example, to compute the theoretical probabilities of observing
0, 1, 2, . . ., inclusions in any one glass sheet manufactured by this process. A
plot of this theoretical probability distribution function is shown in Fig 22.41
(compare with the histogram in Fig 1.3).
The full detail of precisely what all this means is discussed in subsequent
chapters; for now, this current brief preview serves the purpose of simply indicating how the expression in Eqs 4.155 and 4.40 provide a theoretical means
of characterizing (and quantifying) the random phenomenon involved respectively in the yield data and in the inclusions data. Expressions such as this are
called probability distribution functions (pdfs) and they provide the basis
for rational analysis of random variability via the concept of probability.
Precisely what this concept of probability is, how it gives rise to pdfs, and
how pdfs are used to solve practical problems and provide answers to the sorts
of questions posed by these illustrative examples, constitute the primary focus
of the remaining chapters in the book.
At this point, it is best to defer the rest of the discussion until when we
revisit these two problems at appropriate places in upcoming chapters where
we show that:
f (x|) =

1. YA indeed may be considered as greater than YB , and in particular,

that YA YB > 2, up to a specic, quantiable degree of condence,

Random Phenomena

Histogram of YA
Normal
18

Mean
StDev
N

75.52
1.432
50

14
Frequency

12
10
8
6
4
2
0

FIGURE 1.4: Histogram for YA data with superimposed theoretical distribution

Histogram of YB
Normal
9

Mean
StDev
N

72.47
2.764
50

7
Frequency

6
5
4
3
2
1
0

72
YB

FIGURE 1.5: Histogram for YB data with superimposed theoretical distribution

Two Motivating Examples

Distribution Plot
Poisson, Mean=1.02
0.4

Probability

0.3

0.2

0.1

0.0

FIGURE 1.6: Theoretical probability distribution function for a Poisson random variable with parameter = 1.02. Compare with the inclusions data histogram in Fig 1.3

2. There is in fact no evidence in the inclusions data to suggest that the

process has deviated from its design target; i.e. that there is no reason
to believe that X = X , again up to a specic, quantiable degree of
condence.

1.4

Summary and Conclusions

We have introduced two practical problems in this chapter to illustrate

the complications caused by the presence of randomly varying phenomena in
engineering problems. One problem involved determining which of two continuous variables is larger; the other involved determining if a discrete variable
has deviated from its design target. Without the presence of random variability, each problem would ordinarily have been trivial to solve. However, with
intrinsic variability that could not be idealized away, it became clear that special techniques capable of coping explicitly with randomly varying phenomena
would be required to solve these problems satisfactorily. We did not solve the
problems, of course (that is reserved for later); we simply provided an outline of a systematic approach to solving them, which required introducing
some concepts that are to be explored fully later. As a result, the very brief
introduction of the frequency distribution, the graphical histogram, and the
theoretical distribution function was intended to serve merely as a preview of

Random Phenomena

upcoming detailed discussions concerning how randomly varying phenomena

are analyzed systematically.
Here are some of the main points of the chapter again:
The presence of random variability often complicates otherwise straightforward problems so that specialized solution techniques are required;
Frequency distributions and histograms provide a particularly informative perspective of random variations intrinsic to experimental data;
The probability distribution function the theoretical limit to which
the frequency distribution (and histogram) tends provides the basis
for systematic analysis of randomly varying phenomena.

REVIEW QUESTIONS
1. What decision is to be made in the yield improvement problem of Section 1.1?
2. What are the economic factors to be taken into consideration in deciding what
to do with the yield improvement problem?
3. What is the essence of the yield improvement problem as discussed in Section
1.1?
4. What are some of the sources of variability associated with the process yields?
5. Why are the yield variables, YA and YB , continuous variables?
6. What single value is suggested as intuitive for representing a collection of n
data points, x1 , x2 , . . . , xn ?
7. What are some of the issues raised by entertaining the idea of representing the
yield data sets with the arithmetic averages yA and yB ?
8. Why is the number of inclusions found on each glass sheet a discrete variable?
9. What are some sources of variability associated with the glass manufacturing process which may ultimately be responsible for the variability observed in the number
of inclusions?
10. What is a frequency distribution and how is it obtained from raw data?
11. Why will bin size aect the appearance of a frequency distribution?
12. What is a histogram and how is it obtained from data?
13. What is the primary advantage of a histogram over a table of raw data?

Two Motivating Examples

14. What is the relationship between a histogram and a theoretical distribution?

15. What are the expressions in Eqs (4.155) and (4.40) called? These equations
provide the basis for what?

EXERCISES
Section 1.1
1.1 The variance of a collection of n data points, y1 , y2 , . . . , yn , is dened as:
n
)2
i=1 (yi y
(1.5)
s2 =
n1
where y is the arithmetic average of the data set. From the yield data in Table 1.1,
obtain the variances s2A and s2B for the YA and YB data sets, respectively. Which is
greater, s2A or s2B ?
1.2 Even though the data sets in Table 1.1 were not generated in pairs, obtain the
50 dierences,
di = yAi yBi ; i = 1, 2, . . . , 50,
(1.6)
for corresponding values of YA and YB as presented in this table. Obtain a histogram
of di and compute the arithmetic average,
n
1
di .
d =
n i=1

(1.7)

What do these results suggest about the possibility that YA may be greater than YB ?
1.3 A set of theoretical results to be established later (see Chapter 4 Exercises) state
that, for di and d dened in Eq (1.7), and variance s2 dened in Exercise 1,
d =
s2d

yA yB

(1.8)

s2A + s2B

(1.9)

Conrm these results specically for the data in Table 1.1.

Section 1.2
1.4 From the data in Table 1.2, obtain s2x , the variance of the inclusions.
1.5 The random variable, X, representing the number of inclusions, is purported
to be a Poisson random variable (see Chapter 8). If true, then the average, x
, and
variance, s2x , are theoretically equal. Compare the values computed for these two
quantities from the data set in Table 1.2. What do these results suggest about the
possibility that X may in fact be a Poisson random variable?
Section 1.3
1.6 Using a bin size of 0.75, obtain relative frequencies for YA and YB data and the
corresponding histograms. Repeat this exercise for a bin size of 2.0. Compare these

Random Phenomena

two sets of histograms with the corresponding histograms in Figs 12.8 and 1.2.
1.7 From the frequency distribution in Table 12.3 and the values computed for the
average, yA , and variance, s2A of the yield data set, YA , determine the percentage of
the data contained in the interval yA 1.96sA , where sA is the positive square root
of the variance, s2A .
1.8 Repeat Exercise 1.7 for the YB data in Table 1.4. Determine the percentage of
the data contained in the interval yB 1.96sB .
1.9 From Table 1.5 determine the value of x such that only 5% of the data exceeds
this value.
1.10 Using = 75.52 and = 1.43, compute theoretical values of the function in
Eq 4.155 at the center points of the frequency groups for the YA data in Table 12.3;
i.e., for y = 72, 73, . . . , 79. Compare these theoretical values with the corresponding
relative frequency values.
1.11 Repeat Exercise 1.10 for YB data and Table 1.4.
1.12 Using = 1.02, compute theoretical values of the function f (x|) in Eq 4.40
at x = 0, 1, 2, . . . 6 and compare with the corresponding relative frequency values in
Table 1.5.

APPLICATION PROBLEMS
1.13 The data set in the table below is the time (in months) from receipt to publication (sometimes known as time-to-publication) of 85 papers published in the January
2004 issue of a leading chemical engineering research journal.
19.2
9.0
17.2
8.2
4.5
13.5
20.7
7.9
19.5
8.8
18.7
7.4
9.7
13.7
8.1
8.4
10.8

15.1
5.3
12.0
3.0
18.5
5.8
6.8
14.5
3.3
11.1
16.4
7.3
7.4
7.3
5.2
10.2
3.1

9.6
12.9
17.3
6.0
24.3
21.3
19.3
2.5
9.1
8.1
9.8
15.4
15.7
8.2
8.8
7.2
12.8

4.2
4.2
7.8
9.5
3.9
8.7
5.9
5.3
1.8
10.1
10.0
18.7
5.6
3.3
7.3
11.3
2.9

5.4
15.2
8.0
11.7
17.2
4.0
3.8
7.4
5.3
10.6
15.2
11.5
5.9
20.1
12.2
12.0
8.8

(i) Generate a histogram of this data set. Comment on the shape of this histogram

Two Motivating Examples

and why, from the nature of the variable in question, such a shape may not be
surprising.
(ii) From the histogram of the data, what is the most popular time-to-publication,
and what fraction of the papers took longer than this to publish?
1.14 Refer to Problem 1.13. Let each raw data entry in the data table be xi .
(i) Generate a set of 85 sample average publication time, yi , from 20 consecutive
times as follows:
y1

20
1
xi
20 i=1

(1.10)

21
1
xi
20 i=2

(1.11)

...

1
xi
20 i=3
...
20+(j1)

1
22

(1.12)

(1.13)

i=j

For values of j 66, yj should be obtained by replacing x86 , x87 , x88 , . . . , which do
not exist, with x1 , x2 , x3 , . . . , respectively (i.e., for these purposes treat the given
xi data like a circular array). Plot the histogram for this generated yi data and
compare the shape of this histogram with that of the original xi data.
(ii) Repeat part (i) above, this time for zi data generated from:
zj =

1
20

20+(j1)

(1.14)

i=j

for j = 1, 2, . . . , 85. Compare the histogram of the zi data with that of the yi data
and comment on the eect of averaging on the shape of the data histograms.
1.15 The data shown in the table below is a four-year record of the number of
recordable safety incidents occurring at a plant site each month.
1
0
2
0

0
1
2
1

0
0
0
0

0
1
1
0

2
0
2
0

2
0
0
0

0
0
1
0

0
0
2
0

0
0
1
1

1
0
1
0

0
0
0
0

1
1
0
1

(i) Find the average number of safety incidents per month and the associated variance. Construct a frequency table of the data and plot a histogram.
(ii) From the frequency table and the histogram, what can you say about the
chances of obtaining each of the following observations, where x represents the
number of observed safety incidents per month: x = 0, x = 1, x = 2, x = 3, x = 4
and x = 5?
(iii) Consider the postulate that a reasonable model for this phenomenon is:
f (x) =

e0.5 0.5x
x!

(1.15)

Random Phenomena

where f (x) represents the theoretical probability of recording exactly x safety

incidents per month. How well does this model t the data?
(iv) Assuming that this is a reasonable model, discuss how you would use it to
answer the question: If, over the most recent four-month period, the plant recorded
1, 3, 2, 3 safety incidents respectively, is there evidence that there has been a real
increase in the number of safety incidents?
1.16 The table below shows a record of the before and after weights (in pounds)
of 20 patients enrolled in a clinically-supervised ten-week weight-loss program.
Patient #
Before Wt (lbs)
After Wt (lbs)
Patient #
Before Wt (lbs)
After Wt (lbs)

1
272
263
11
215
206

2
319
313
12
245
235

3
253
251
13
248
237

4
325
312
14
364
350

5
236
227
15
301
288

6
233
227
16
203
195

7
300
290
17
197
193

8
260
251
18
217
216

9
268
262
19
210
202

10
276
263
20
223
214

Let XB represent the Before weight and XA the After weight.

(i) Using the same bin size for each data set, obtain histograms for the XB and XA
data and plot both on the same graph. Strictly on the basis of a visual inspection
of these histograms, what can you say about the eectiveness of the weight-loss
program in achieving its objective of assisting patients to lose weight?
(ii) Dene the dierence variable, D = XB XA , and from the given data, obtain
and plot a histogram for this variable. Again, strictly from a visual inspection of this
histogram, what can you say about the eectiveness of the weight-loss program?
1.17 The data shown in the following table is from an Assisted Reproductive
Technologies clinic where a cohort of 100 patients under the age of 35 years (the
Younger group), and another cohort, 35 years and older (the Older group), each
received ve embryos in an in-vitro fertilization (IVF) treatment cycle.
x
No. of live
births in a
delivered
pregnancy
0
1
2
3
4
5

yO
Total no. of
older patients
(out of 100)
with pregnancy outcome x
32
41
21
5
1
0

yY
Total no. of
younger patients
(out of 100)
with pregnancy outcome x
8
25
35
23
8
1

The data shows x, the number of live births per delivered pregnancy, along
with how many in each group had the pregnancy outcome of x. For example, the
rst entry indicates that the IVF treatment was unsuccessful for 32 of the older
patients, with the corresponding number being 8 for the younger patients; 41
older patients delivered singletons, compared with 25 for the younger patients; 21
older patients and 35 younger patients each delivered twins; etc. Obtain a relative
frequency distribution for these data sets and plot the corresponding histograms.
Determine the average number of live births per delivered pregnancy for each group

Two Motivating Examples

and compare these values. Comment on whether or not these data sets indicate that
the outcomes of the IVF treatments are dierent for these two groups.

Random Phenomena

Chapter 2
Random Phenomena, Variability and
Uncertainty

2.1

2.2

2.3

2.4
2.5

Two Extreme Idealizations of Natural Phenomena . . . . . . . . . . . . . . . . . . . . . . .

2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 A Chemical Engineering Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Determinism and the PFR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Randomness and the CSTR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Theoretical Analysis of the Ideal CSTRs Residence Time . . . .
Random Mass Phenomena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Dening Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Variability and Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3 Practical Problems of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introducing Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Interpreting Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
Classical (A-Priori)
Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
Relative Frequency (A-Posteriori)
Probability . . . . . . . . . . . . . . . . .
Subjective Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Probabilistic Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34
34
35
35
37
37
41
41
42
42
43
43
44
45
46
46
47
48
49
50
51

Through the great benecence of Providence,

what is given to be foreseen in the general sphere of masses
escapes us in the conned sphere of individuals.
Joann`e-Erhard Valentin-Smith (17961891)

When John Stuart Mills stated in his 1862 book, A System of Logic: Ratiocinative and Inductive, that ...the very events which in their own nature
appear most capricious and uncertain, and which in any individual case no
attainable degree of knowledge would enable us to foresee, occur, when considerable numbers are taken into account, with a degree of regularity approaching
to mathematical ..., he was merely articulatingastutely for the timethe
then-radical, but now well-accepted, concept that randomness in scientic observation is not a synonym for disorder; it is order of a dierent kind. The more
familiar kind of order informs determinism: the concept that, with sucient
mechanistic knowledge, all physical phenomena are entirely predictable and
thus describable by precise mathematical equations. But even classical physics,
that archetypal deterministic science, had to make room for this other kind
33

Random Phenomena

of order when quantum physicists of the 1920s discovered that fundamental

particles of nature exhibit irreducible uncertainty (or chance) in their locations, movements and interactions. And today, most contemporary scientists
and engineers are, by training, conditioned to accept both determinism and
randomness as intrinsic aspects of the experiential world. The problem, however, is that to many, the basic characteristics of random phenomena and their
order of a dierent kind still remain somewhat unfamiliar at a fundamental
level.
This chapter is devoted to an expository examination of randomly varying
phenomena. Its primary purpose is to introduce the reader to the central
characteristic of order-in-the-midst-of-variability, and the sort of analysis this
trait permits, before diving headlong into a formal study of probability and
statistics. The premise of this chapter is that a true appreciation of the nature
of randomly varying phenomena at a fundamental level is indispensable to the
sort of clear understanding of probability and statistics that will protect the
diligent reader from all-too-common misapplication pitfalls.

2.1
2.1.1

Two Extreme Idealizations of Natural Phenomena

Introduction

In classical physics, the distance, x, (in meters) traveled in t seconds by

an object launched with an initial velocity, u m/s, and which accelerates at a
m/s2 , is known to be given by the expression:
1
(2.1)
x = ut + at2
2
This is a deterministic expression: it consistently and repeatably produces
the same result every time identical values of the variables u, a, and t are
specied. The same is true for the expression used in engineering to determine
Q, the rate of heat loss from a house, say in the middle of winter, when
the total exposed surface area is A m2 , the inside temperature is Ti K, the
outside temperature is To K, and the combined heat transfer characteristics of
the house walls, insulation, etc., is represented by the so-called overall heat
transfer coecient U , W/m2 K, i.e.
Q = U A(Ti To )

(2.2)

The rate of heat loss is determined precisely and consistently for any given
specic values of each entity on the right hand side of this equation.
The concept of determinism, that the phenomenon in question is precisely
determinable in every relevant detail, is central to much of science and engineering and has proven quite useful in analyzing real systems, and in solving practical problemswhether it is computing the trajectory of rockets for

Random Phenomena, Variability and Uncertainty

launching satellites into orbit, installing appropriate insulation for homes, or

designing chemical reactors . However, any assumption of strict determinism
in nature is implicitly understood as a convenient idealization resulting from
neglecting certain details considered non-essential to the core problem. For example, the capriciousness of the wind and its various and sundry eects have
been ignored in Eqs 16.2 and 2.2: no signicant wind resistance (or assistance)
in the former, negligible convective heat transfer in the latter.
At the other extreme is randomness, where the relevant details of the
phenomenon in question are indeterminable precisely; repeated observations
under identical conditions produce dierent and randomly varying results;
and the observed random variability is essential to the problem and therefore
cannot be idealized away. Such is the case with the illustrative problems in
Chapter 1 where in one case, the yield obtained from each process may be
idealized as follows:
yAi
yBi

=
=

A + Ai
B + Bi

(2.3)
(2.4)

with A and B representing the true but unknown yields obtainable from
processes A and B respectively, and Ai and Bi representing the superimposed
randomly varying componentthe sources of the random variability evident
in each observation yAi and yBi . Identical values of A do not produce identical
values of yAi in Eq (2.3); neither will identical values of B produce identical
values of yBi in Eq (2.4). In the second case of the glass process and the
number of inclusions per square meter, the idealization is:
xi = + i

(2.5)

where is the true number of inclusions associated with the process and i is
the superimposed random component responsible for the observed randomness
in the actual number of inclusion xi found on each individual glass sheet upon
inspection.
These two perspectives, determinism and randomness, are thus two
opposite idealizations of natural phenomena, the former when deterministic
aspects of the phenomenon are considered to be overwhelmingly dominant
over any random components, the latter case when the random components
are dominant and central to the problem. The principles behind each conceptual idealization, and the analysis technique appropriate to each, are now
elucidated with a chemical engineering illustration.

2.1.2

A Chemical Engineering Illustration

Residence time , the amount of time a uid element spends in a chemical

reactor, is an important parameter in the design of chemical reactors. We wish
to consider residence times in two classic reactor congurations: the plug ow
reactor (PFR) and the continuous stirred tank reactor (CSTR).

Random Phenomena

C0 G(t)
Fluid
element
F m3/s
lm

FIGURE 2.1: Schematic diagram of a plug ow reactor (PFR).

Determinism and the PFR
The plug ow reactor (PFR) is a hollow tube in which reactants that are
introduced at one end react as the uid elements traverse the length of the
tube and emerge at the other end. The name comes from the idealization that
uid elements move through as plugs with no longitudinal mixing (see Fig
2.1).
The PFR assumptions (idealizations) may be stated as follows:
the reactor tube (l m long) has a uniform cross-sectional area, A m2 ;
uid elements move in plug ow with a constant velocity, v m/s, so
that the velocity prole is at;
the ow rate through the reactor is constant at F m3 /s
Now consider that at time t = 0 we instantaneously inject a bolus of red dye
of concentration C0 moles/m3 into the inlet stream. The following question
is of interest in the study of residence time distributions in chemical reactor
design:
How much time does each molecule of red dye spend in the reactor,
if we could label them all and observe each one at the reactor exit?
Because of the plug ow idealization, each uid element moves through the
reactor with a constant velocity given by:

F
v=
m/s
(2.6)
A
and it will take precisely
l
= =
v

lA
F

secs

(2.7)

for each dye element to traverse the reactor. Hence, , the residence time for
an ideal plug ow reactor (PFR) is a deterministic quantity because its value
is exactly and precisely determinable from Eq (2.7) given F, A and l.
Keep in mind that the determinism that informs this analysis of the PFR

Random Phenomena, Variability and Uncertainty

C0 G(t)

F m3/s

Volume
V m3

FIGURE 2.2: Schematic diagram of a continuous stirred tank reactor (CSTR).

residence time arises directly as a consequence of the central plug ow idealization. Any departures from such idealization, especially the presence of
signicant axial dispersion (leading to a non-at uid velocity prole), will
result in dye molecules no longer arriving at the outlet at precisely the same
time.
Randomness and the CSTR
With the continuous stirred tank reactor (CSTR), the reactant stream
continuously ows into a tank that is vigorously stirred to ensure uniform
mixing of its content, while the product is continuously withdrawn from the
outlet (see Fig 2.2). The assumptions (idealizations) in this case are:
the reactor tank has a xed, constant volume, V m3 ;
the contents of the tank are perfectly mixed.
Once again, let us consider that a bolus of red dye of concentration C0
moles/m3 is instantaneously injected into the inlet stream at time t = 0; and
again, ask: how much time does each molecule of red dye spend in the reactor?
Unlike with the plug ow reactor, observe that it is impossible to answer this
question `
a-priori, or precisely: because of the vigorous stirring of the reactor
content, some dye molecules will exit almost instantaneously; others will stay
longer, some for a very long time. In fact, it can be shown that theoretically,
0 < < . Hence in this case, , the residence time, is a randomly varying
quantity that can take on a range of values from 0 to ; it cannot therefore be
adequately characterized as a single number. Notwithstanding, as all chemical
engineers know, the random phenomenon of residence times for ideal CSTRs
can, and has been, analyzed systematically (see for example, Hill, 19771).
1 C.G. Hill, Jr, An Introduction to Chemical Engineering Kinetics and Reactor Design,
Wiley, NY, 1977, pp 388-396.

Random Phenomena

Theoretical Analysis of the Ideal CSTRs Residence Time

Even though based on chemical engineering principles, the results of the
analysis we are about to discuss have fundamental implications for the general
nature of the order present in the midst of random variability encountered in
other applications, and how such order provides the basis for analysis. (As
an added bonus, this analysis also provides a non-probabilistic view of ideas
usually considered the exclusive domain of probability).
By carrying out a material balance around the CSTR, (i.e., that the rate
of accumulation of mass within a prescribed volume must equal the dierence
between the rate of input and the rate of output) it is possible to develop
a mathematical model for this process as follows: If the volumetric ow rate
into and out of the reactor are equal and given by F m3 /s, if C(t) represents
the molar concentration of the dye in the well-mixed reactor, then by the
assumption of perfect mixing, this will also be the dye concentration at the
exit of the reactor. The material balance equation is:
V

dC
= F Cin F C
dt

(2.8)

where Cin is the dye concentration in the inlet stream. If we dene the parameter as
V
=
(2.9)
F
and note that the introduction of a bolus of dye of concentration C0 at t = 0
implies:
(2.10)
Cin = C0 (t)
where (t) is the Dirac delta function, then Eq (2.8) becomes:

dC
= C + C0 (t)
dt

(2.11)

a simple, linear rst order ODE whose solution is:

C(t) =

C0 t/
e

(2.12)

If we now dene as f (), the instantaneous fraction of the initial number of

injected dye molecules exiting the reactor at time t = (those with residence
time ), i.e.
C()
(2.13)
f () =
C0
we obtain immediately from Eq (2.12) that
f () =

1 /
e

(2.14)

recognizable to all chemical engineers as the familiar exponential instantaneous residence time distribution function for the ideal CSTR. The reader

Random Phenomena, Variability and Uncertainty

0.20

0.15

0.10

0.05

0.00

FIGURE 2.3: Instantaneous residence time distribution function for the CSTR:
(with = 5).
should take good note of this expression: it shows up a few more times and
in various guises in subsequent chapters. For now, let us observe that, even
though (a) the residence time for a CSTR, , exhibits random variability,
potentially able to take on values between 0 and (and is therefore not describable by a single value); so that (b) it is therefore impossible to determine
with absolute certainty precisely when any individual dye molecule will leave
the reactor; even so (c) the function, f (), shown in Eq (4.41), mathematically
characterizes the behavior of the entire ensemble of dye molecules, but in a
way that requires some explanation:
1. It represents how the residence times of uid particles in the well-mixed
CSTR are distributed over the range of possible values 0 < <
(see Fig 2.3).
2. This distribution of residence times is a well-dened, well-characterized
function, but it is not a description of the precise amount of time a particular individual dye molecule will spend in the reactor; rather it is a
description of how many (or what fraction) of the entire collection of
dye molecules will spend what amount of time in the reactor. For example, in broad terms, it indicates that a good fraction of the molecules
have relatively short residence times, exiting the reactor quickly; a much
smaller but non-zero fraction have relatively long residence times. It can
also provide more precise statements as follows.
3. From this expression (Eq (4.41)), we can determine the fraction of dye
molecules that have remained in the reactor for an amount of time less
than or equal to some time t, (i.e. molecules exiting the reactor with

Random Phenomena
age less than or equal to t): we do this by integrating f () with respect
to , as follows, to obtain
t

1 /
F (t) =
e
d = 1 et/
(2.15)
0
from which we see that F (0), the fraction of dye molecules with age less
than or equal to zero is exactly zero: indicating the intuitively obvious
that, no matter how vigorous the mixing, each dye molecule spends at
least a nite, non-zero, amount of time in the reactor (no molecule exits
instantaneously upon entry).
On the other hand, F () = 1, since

1 /
e
F () =
d = 1

(2.16)

again indicating the obvious: if we wait long enough, all dye molecules
will eventually exit the reactor as t . In other words, the fraction
of molecules exiting the reactor with age less than is exactly 1.
4. Since the fraction of molecules that will have remained in the reactor
for an amount of time less than or equal to t is F (t), and the fraction
that will have remained in the reactor for less than or equal to t + t
is F (t + t), then the fraction with residence time in the innitesimal
interval between t and t + t) is given by:

t+t

[t (t + t)] = F (t + t) F (t) =
t

1 /
e
d

(2.17)

which, for very small t, simplies to:

[t (t + t)] f (t)t

(2.18)

5. And nally, the average residence time may be determined from the
expression in Eq (4.41) (and Eq (2.16)) as:

1 /
1
/
e
d
d

0
0 e
=
= 1 /
=
(2.19)
1
d
0 e
where the numerator integral is evaluated via integration by parts. Observe from the denition of above (in Eq (2.9)) that this result makes
perfect sense, strictly from the physics of the problem: particles in a
stream owing at the rate F m3 /s through a well-mixed reactor of volume V m3 , will spend an average of V /F = seconds in the reactor.
We now observe in conclusion two important points: (i) even though at no
point in the preceding discussion have we made any overt or explicit appeal

Random Phenomena, Variability and Uncertainty

to the concepts of probability, the unmistakable ngerprints of probability are

evident all over (as upcoming chapters demonstrate concretely, but perhaps
already recognizable to those with some familiarity with such concepts); (ii)
Nevertheless, this characterizing model in Eq (4.41) was made possible via
rst-principles knowledge of the underlying phenomenon. This is a central
characteristic of random phenomena: that appropriate theoretical characterizations are almost always possible in terms of ideal ensemble models of the
observed variability dictated by the underlying phenomenological mechanism.

2.2
2.2.1

Random Mass Phenomena

Dening Characteristics

In such diverse areas as actuarial science, biology, chemical reactors, demography, economics, nance, genetics, human mortality, manufacturing quality assurance, polymer chemistry, etc., one repeatedly encounters a surprisingly common theme whereby phenomena which, on an individual level, appear entirely unpredictable, are well-characterized as ensembles (as demonstrated above with residence time distribution in CSTRs). For example, as
far back as 1662, in a study widely considered to be the genesis of population
demographics and of modern actuarial science by which insurance premiums
are determined today, the British haberdasher, John Graunt (1620-1674), had
observed that the number of deaths and the age at death in London were surprisingly predictable for the entire population even though it was impossible to
predict which individual would die when and in what manner. Similarly, while
the number of monomer molecules linked together in any polymer molecule
chain varies considerably, how many chains of a certain length a batch of
polymer product contains can be characterized fairly predictably.
Such natural phenomena noted above have come to be known as Random
Mass Phenomena, with the following dening characteristics:
1. Individual observations appear irregular because it is not possible to
predict each one with certainty; but
2. The ensemble or aggregate of all possible outcomes is regular, wellcharacterized and determinable;
3. The underlying phenomenological mechanisms accounting for the nature and occurrence of the specic observations determines the character of the ensemble;
4. Such phenomenological mechanisms may be known mechanistically (as
was the case with the CSTR), or its manifestation may only be deter-

Random Phenomena
mined from data (as was the case with John Graunts mortality tables
of 1662).

This fortunate circumstanceaggregate predictability amidst individual

irregularitiesis why the primary issue with random phenomena analysis boils
down to how to use ensemble descriptions and characterization to carry out
systematic analysis of the behavior of individual observations.

2.2.2

Variability and Uncertainty

While ensemble characterizations provide a means of dealing systematically with random mass phenomena, many practical problems still involve
making decisions about specic, inherently unpredictable, outcomes. For example, the insurance company still has to decide what premium to charge each
individual on a person-by-person basis. When decisions must be made about
specic outcomes of random mass phenomena, uncertainty is an inevitable
consequence of the inherent variability. Furthermore, the extent or degree
of variability directly aects the degree of uncertainty: tighter clustering of
possible outcomes implies less uncertainty, whereas a broader distribution of
possible outcomes implies more uncertainty. The most useful mathematical
characterization of ensembles must therefore permit not only systematic analysis, but also a rational quantication of the degree of variability inherent in
the ensemble, and the resulting uncertainty associated with each individual
observation as a result.

2.2.3

Practical Problems of Interest

Let xi represent individual observations, i = 1, 2, . . . , n, from a random

mass phenomenon; let X be the actual variable of interest, dierent and distinct from xi , this latter being merely one out of many other possible realizations of X. For example, X can be the number of live births delivered by a
patient after a round of in-vitro fertilization treatment, a randomly varying
quantity; whereas xi = 2 (i.e. twins) is the specic outcome observed for a
specic patient after a specic round of treatment. For now, let the aggregate
description we seek be represented as f (x) (see for example, Eq (4.41) for the
CSTR residence time); what this is and how it is obtained is discussed later.
In practice, only data in the form of {xi }ni=1 observations is available. The
desired aggregate description, f (x), must be understood in its proper context
as a descriptor of the (possibly innite) collection of all possible outcomes of
which the observed data is only a sample. The fundamental problems of
random phenomena analysis may now be stated formally as follows:
1. Given {xi }ni=1 what can we say about the complete f (x)?
2. Given f (x) what can we say about the specic xi values (both the observed {xi }ni=1 and the yet unobserved)?

Random Phenomena, Variability and Uncertainty

Embedded in these questions are the following aliated questions that arise
as a consequence: (a) how was {xi }ni=1 obtained in (1); will the procedure for
obtaining the data aect how well we can answer question 1? (b) how was
f (x) determined in (2)?
Subsequent chapters are devoted to dealing with these fundamental problems systematically and in greater detail.

2.3
2.3.1

Introducing Probability
Basic Concepts

Consider the prototypical random phenomenon for which the individual

observation (or outcome) is not known with certainty `
a-priori, but the complete totality of all possible observations (or outcomes) has been (or can be)
compiled. Now consider a framework that assigns to each individual member
of this collection of possible outcomes, a real-valued number between 0 and 1
that represents the probability of its occurrence, such that:
1. an outcome that is certain to occur is assigned the number 1;
2. an outcome that is certain not to occur is assigned the number 0;
3. any other outcome falling between these two extremes is assigned a
number that reects the extent or degree of certainty (or uncertainty)
associated with its occurrence.
Notice how this represents a shift in focus from the individual outcome itself
to the probability of its occurrence. Using precise denitions and terminology,
along with tools of set theory, set functions and real analysis, we show in the
chapters in Part II how to develop the machinery for the theory of probability,
and the emergence of a compact functional form indicating how the probabilities of occurrence are distributed over all possible outcomes. The resulting
probability distribution function becomes the primary vehicle for analyzing
the behavior of random phenomena.
For example, the phenomenon of inclusions in manufactured glass sheets
discussed in Chapter 1 is well-characterized by the following probability distribution function (pdf) which indicates the probability of observing exactly
x inclusions on a glass sheet as
f (x) =

e x
; x = 0, 1, 2, . . .
x!

(2.20)

a pdf with a single parameter, , characteristic of the manufacturing process

used to produce the glass sheets. (As shown later, is the mean number of

Random Phenomena

TABLE 2.1:

Computed
probabilities of occurrence of
various number of inclusions
for = 2 in Eq (9.2)
x = No of f (x) prob of
inclusions occurrence
0
0.135
0.271
1
2
0.271
3
0.180
0.090
4
5
0.036
..
..
.
.
0.001
8
9
0.000

inclusions on a glass sheet.) Even though we do not know precisely how

many inclusions will be found on the next glass sheet inspected in the QC
lab, given the parameter , we can use Eq (9.2) to make statements about
the probabilities of individual occurrences. For instance, if = 2 for a certain
process, Eq (9.2) allows us to state that the probability of nding a perfect
glass sheet with no inclusions in the products made by this process (i.e.
x = 0) is 0.135; or that the probability of nding 1 inclusion is 0.227,
coincidentally the same as the probability of nding 2 inclusions; or that
there is a vanishingly small probability of nding 9 or more inclusions in
this production facility. The complete set of probabilities computed from Eq
(9.2) is shown in Table 2.1.

2.3.2

Interpreting Probability

There always seems to be a certain amount of debate over the meaning,

denition and interpretation of probability. This is perhaps due to a natural
predisposition towards confusing a conceptual entity with how a numerical
value is determined for it. For example, from a certain perspective, temperature, as a conceptual entity in Thermodynamics, is a real number assigned
to an object to indicate its degree of hotness; it is distinct from how its
value is determined (by a thermometer, thermocouple, or any other means).
The same is true of mass, a quantity assigned in Mechanics to a body to
indicate how much matter it contains and how heavy it will be in a gravitational eld ; or distance assigned in geometry to indicate the closeness of
two points in a geometric space. The practical problem of how to determine
numerical values for these quantities, even though important in its own right,
is a separate issue entirely.
This is how probability should be understood: it is simply a quantity that

Random Phenomena, Variability and Uncertainty

is assigned to indicate the degree of uncertainty associated with the occurrence of a particular outcome. As with temperature the conceptual quantity,
how a numerical value is determined for the probability of the occurrence
of a particular outcome under any specic circumstance depends on the circumstance itself. To carry the analogy with temperature a bit further: while
a thermometer capable of determining temperature to within half a degree
will suce in one case, a more precise device, such as a thermocouple, may
be required in another case, and an optical pyrometer for yet another case.
Whatever the case, under no circumstance should the device employed to determine its numerical value usurp the role of, or become the surrogate for,
temperature the quantity. This is important in properly interpreting probability, the conceptual entity: how an appropriate value is to be determined
for probability, an important practical problem in its own right, should not
be confused with the quantity itself.
With these ideas in mind, let us now consider several standard perspectives
of probability that have evolved over the years. These are best understood as
various techniques for how numerical values are determined rather than what
probability is.
`
Classical (A-Priori)
Probability
Consider a random phenomenon for which the total number of possible
outcomes is known to be N , all of which are equally likely; of these, let NA
be the number of outcomes in which A is observed (i.e. outcomes that are
favorable to A). Then according to the classical (or `
a-priori) perspective,
the probability of the occurrence of outcome A is dened as
P (A) =

NA
N

(2.21)

For example, in tossing a single perfect die once, the probability of observing
a 3 is, according to this viewpoint, evaluated as 1/6, since the total number of
possible outcomes is 6 of which only 1 is favorable to the desired observation
of 3. Similarly, if B is the outcome that one observes an odd number of dots,
then P (B) = 3/6 = 0.5.
Observe that according to this view, no experiments have been performed
yet; the formulation is based entirely on an `
a-priori enumeration of N and
NA . However, this intuitively appealing perspective is not always applicable:
What if all the outcomes are not equally likely?
How about random phenomena whose outcomes cannot be characterized
as cleanly in this fashion, say, for example, the prospect of a newly purchased refrigerator lasting for 25 years without repair? or the prospect
of snow falling on a specic April day in Wisconsin?
What Eq. (2.21) provides is an intuitively appealing (and theoretically sound)
means of determining an appropriate value for P (A); but it is restricted only

Random Phenomena

to those circumstances where the random phenomenon in question is characterized in such a way that N and NA are natural and easy to identify.
`
Relative Frequency (A-Posteriori)
Probability
On the opposite end of the spectrum from the `
a-priori perspective is the
following alternative: consider an experiment that is repeated n times under identical conditions, where the outcomes involving A have been observed
a-posteriori, the probability of the occurrence of
to occur nA times. Then, `
outcome A is dened as
nA
P (A) = lim
(2.22)
n n
The appeal of this viewpoint is not so much that it is just as intuitive as the
previous one, but that it is also empirical, making no assumptions about equal
likelihood of outcomes. It is based on the actual performance of experiments
and the actual `
a-posteriori observation of the relative frequency of occurrences
of the desired outcome. This perspective provides a prevalent interpretation
of probability as the theoretical value of long range relative frequencies. In
fact, this is what motivates the notion of the theoretical distribution as the
limiting form to which the empirical frequency distribution tends with the
acquisition of increasing amounts of data.
However, this perspective also suers from some limitations:
How many trials, n, is sucient for Eq (2.22) to be useful in practice?
How about random phenomena for which the desired outcome does not
lend itself to repetitive experimentation under identical conditions, say,
for example, the prospect of snow falling on a specic April day in Wisconsin? or the prospect of your favorite team winning the basketball
championship next year?
Once again, these limitations arise primary because Eq (2.22) is simply just
another means of determining an appropriate value for P (A) that happens
to be valid only when the random phenomenon is such that the indicated repeated experimentation is not only possible and convenient, but for which, in
practice, truncating after a suciently large number of trials to produce a
nite approximation presents no conceptual dilemma. For example, after tossing a coin 500 times and obtaining 251 heads, declaring that the probability
of obtaining a head upon a single toss as 0.5 presents no conceptual dilemma
whatsoever.
Subjective Probability
There is yet another alternative perspective whereby P (A) is taken simply
as a measure of the degree of (personal) belief associated with the postulate
that A will occur, the value having been assigned subjectively by the individual concerned, akin to betting odds. Thus, for example, in rolling a perfect
die, the probability of obtaining a 3 is assigned strictly on the basis of what the

Random Phenomena, Variability and Uncertainty

individual believes to be the likely odds of obtaining this outcome, without

recourse to enumerating equally likely outcomes (the `
a-priori perspective), or
performing the die roll an innite number of times (the `
a-posteriori perspective).
The obvious diculty with this perspective is its subjectivity, so that outcomes that are equally likely (on an objective basis) may end up being assigned
dierent probabilities by dierent individuals. Nevertheless, for those practical applications where the outcomes cannot be enumerated, and for which the
experiment cannot be repeated a large number of times, the subjective allocation of probability may be the only viable option, at least `
a-priori. As we
show later, it is possible to combine this initial subjective declaration with subsequent limited experimentation in order to introduce objective information
contained in data in determining appropriate values of the sought probabilities
objectively.

2.4

The Probabilistic Framework

Beginning with the next chapter, Part II is devoted to an axiomatic treatment of probability, including basic elements of probability theory, random
variables, and probability distribution functions, within the context of a comprehensive framework for systematically analyzing random phenomena.
The central conceptual elements of this framework are: (i) a formal representation of uncertain outcomes with the random variable, X; and (ii) the
mathematical characterization of this random variable by the probability distribution function (pdf), f (x). How the probabilities are distributed over the
entire aggregate collection of all possible outcomes, expressed in terms of the
random variable, X, is contained in this pdf. The following is a procedure for
problem-solving within this framework:
1. Problem Formulation: Dene and formulate the problem appropriately.
Examine the random phenomenon in question, determine the random
variable(s), and assemble all available information about the underlying
mechanisms;
2. Model Development : Identify, postulate, or develop an appropriate ideal
model of the relevant random variability in the form of the probability
distribution function f (x);
3. Problem Solution: Use the model to solve the relevant problem (analysis,
prediction, inference, estimation, etc.);
4. Results validation: Analyze and validate the result and, if necessary,
return to any of the preceding steps as appropriate.

Random Phenomena

This problem-solving approach is illustrated throughout the rest of the book,

particularly in the chapters devoted to actual case studies.

2.5

Summary and Conclusions

Understanding why, despite appearances, randomly varying phenomena

can be subject to analysis of any sort at all is what has occupied our attention
in this chapter. Before beginning a formal discussion of random phenomena
analysis itself, it was necessary to devote some time to a closer examination of several important foundational issues that are essential to a solid understanding of randomly varying phenomena and their analysis: determinism
and randomness; variability and uncertainty; probability and the probabilistic
framework for solving problems involving random variability. Using idealized
chemical reactors as illustration, we have presented determinism and randomness as two extreme idealizations of natural phenomena. The residence time
of a dye molecule in the hollow tube of a plug ow reactor (PFR) was used to
demonstrate the ideal deterministic variable whose value is xed and determinable precisely. At the other end of the spectrum is the length of time the
dye molecule spends in a vigorously stirred vessel, the ideal continuous stirred
tank reactor (CSTR). This time the variable is random and hence impossible
to determine precisely `
a priori, but it is not haphazard. The mathematical
model derived for the distribution of residence times in the CSTRespecially
how it was obtained from rst principlesprovides a preview and a chemical
engineering analog of what is to come in Chapters 8 and 9, where models are
derived for a wide variety of randomly varying phenomena in similar fashion
on the basis of underlying phenomenological mechanisms.
We also examined the characteristics of random mass phenomena, especially highlighting the co-existence of aggregate predictability in the midst of
individual irregularities. This order-in-the-midst-of-variability makes possible the use of probability and probability distributions to characterize ensemble behavior mathematically. The subsequent introduction of the concept of
probability, while qualitative and informal, is nonetheless important. Among
other things, it provided a non-technical setting for dealing with the potentially confusing issue of how to interpret probability. In this regard, it bears
reiterating that much confusion can be avoided by remembering to keep the
concept of probabilityas a quantity between 0 and 1 used to quantify degree of uncertaintyseparate from the means by which numerical values are
determined for it. It is in this latter sense that the various interpretations
of probabilityclassical, relative frequency, and subjectiveare to be understood: these are all various means of determining a specic value for the
probability of a specic outcome; and, depending on the situation at hand,
one approach is often more appropriate than others.

Random Phenomena, Variability and Uncertainty

Here are some of the main points of the chapter again:

Randomness does not imply disorder; it is order of a dierent kind,
whereby aggregate predictability co-exists with individual irregularity;
Determinism and randomness are two extreme idealizations of naturally
occurring phenomena, and both are equally subject to rigorous analysis;
The mathematical framework to be employed in the rest of this book
is based on probability, the concept of a random variable, X, and its
mathematical characterization by the pdf, f (x).

REVIEW QUESTIONS
1. If not a synonym for disorder, then what is randomness in scientic observation?
2. What is the concept of determinism?
3. Why are the expressions in Eqs (16.2) and (2.2) considered deterministic?
4. What is an example phenomenon that had to be ignored in order to obtain the
deterministic expressions in Eq (16.2)? And what is an example phenomenon that
had to be ignored in order to obtain the deterministic expressions in Eq (2.2)?
5. What are the main characteristics of randomness as described in Subsection
2.1.1?
6. Compare and contrast determinism and randomness as two opposite idealizations
of natural phenomena.
7. Which idealized phenomenon does residence time in a plug ow reactor (PFR)
represent?
8. What is the central plug ow idealization in a plug ow reactor, and how will
departures from such idealization aect the residence time in the reactor?
9. Which idealized phenomenon does residence time in a continuous stirred-tank
reactor (CSTR) represent?
10. On what principle is the mathematical model in Eq (2.8) based?
11. What does the expression in Eq (4.41) represent?
12. What observation by John Graunt is widely considered to be the genesis of
population demographics and of modern actuarial science?
13. What are the dening characteristics of random mass phenomena?

Random Phenomena

14. How does inherent variability give rise to uncertainty?

15. What are the fundamental problems of random phenomena analysis as presented
in Subsection 2.2.3?
16. What is the primary mathematical vehicle introduced in Subsection 2.3.1 for
analyzing the behavior of random phenomena?
17. What is the classical (`
a-priori) perspective of probability and when is it not
applicable?
18. What is the relative frequency (`
a-posteriori) perspective of probability and what
are its limitations?
19. What is the subjective perspective of probability and under what circumstances
is it the only viable option for specifying probability in practice?
20. What are the central conceptual elements of the probabilistic framework?
21. What are the four steps in the procedure for problem-solving within the probabilistic framework?

EXERCISES
Section 2.1
2.1 Solve Eq (2.11) explicitly to conrm the result in Eq (2.12).
2.2 Plot the expression in Eq (2.15) as a function of the scaled time variable, t = t/ ;
determine the percentage of dye molecules with age less than or equal to the mean
residence time, .
2.3 Show that

1 /
e
d =

(2.23)

and hence conrm the result in Eq (2.19).

Section 2.2
2.4 The following probability distribution functions:

and

x2
1
f (x) = e 18 ; < x <
3 2

(2.24)

1 y2
f (y) = e 2 ; < y <
2

(2.25)

represent how the occurrences of all the possible outcomes of the two randomly
varying, continuous variables, X and Y , are distributed. Plot these two distribution

Random Phenomena, Variability and Uncertainty

functions on the same graph. Which of these variables has a higher degree of uncertainty associated with the determination of any particular outcome. Why?
2.5 When a fair coin is tossed 4 times, it is postulated that the probability of
obtaining x heads is given by the probability distribution function:
f (x) =

4!
0.54
x!(4 x)!

(2.26)

Determine the probability of obtaining x = 0, 1, 2, . . . , 4 heads. Intuitively, which of

these outcomes would you think will be the most likely? Are the results of your
computation consistent with your intuition?
Section 2.3
2.6 In tossing a fair coin once, describe the classical (`
a-priori), relative frequency
(`
a-posteriori), and the subjective perspectives of the probability of obtaining a head.

APPLICATION PROBLEMS
2.7 For each of the following two-reactor congurations:
(a) two plug ow reactors in series where the length of reactor 1 is l1 m, and
that of reactor 2 is l2 m, but both have the same uniform cross-sectional area
A m2 ;
(b) two continuous stirred tank reactors with volumes V1 and V2 m3 ;
(c) the PFR in Fig 2.1 followed by the CSTR in Fig 2.2;
given that the ow rate through each reactor ensemble is constant at F m3 /s, obtain
the residence time, , or the residence time distribution, f (), as appropriate. Make
any assumption you deem appropriate about the concentration C1 (t) and C2 (t) in
the rst and second reactors, respectively.
2.8 In the summer of 1943 during World War II, a total of 365 warships were attacked
by Kamikaze pilots: 180 took evasive action and 60 of these were hit; the remaining
185 counterattacked, of which 62 were hit. Using a relative frequency interpretation
and invoking any other assumption you deem necessary, determine the probability
that any attacked warship will be hit regardless of tactical response. Also determine
the probability that a warship taking evasive action will be hit and the probability
that a counterattacking warship will be hit. Compare these three probabilities and
discuss what this implies regarding choosing an appropriate tactical response. (A
full discussion of this problem is contained in Chapter 7.)
2.9 Two American National Football League (NFL) teams, A and B, with respective Win-Loss records 9-6 and 12-3 after 15 weeks, are preparing to face each other
in the 16th and nal game of the regular season.
(i) From a relative frequency perspective of probability, use the supplied information
(and any other assumption you deem necessary) to compute the probability of Team
A winning any generic game, and also of Team B winning any generic game.

Random Phenomena

(ii) When the two teams play each other, upon the presupposition that past record
is the best indicator of a teams chances of winning a new game, determine reasonable values for P (A), the probability that team A wins the game, and P (B), the
probability that team B wins, assuming that this game does not end up in a tie.
Note that for this particular case,
P (A) + P (B) = 1

(2.27)

Part II

Probability

Part II: Probability

Characterizing Random Variability

Here we have the opportunity of expounding more clearly what has

already been said
Rene Descartes (15961650)

Part II: Probability

Characterizing Random Variability

Chapter 3: Fundamentals of Probability Theory

Chapter 4: Random Variables
Chapter 5: Multidimensional Random Variables
Chapter 6: Random Variable Transformations
Chapter 7: Application Case Studies I: Probability

Chapter 3
Fundamentals of Probability Theory

3.1
3.2

3.3
3.4

3.5
3.6

Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Events, Sets and Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Set Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Probability Set Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.4 Final considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 The Calculus of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Illustrating the Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Formalizing the Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.3 Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.4 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58
60
61
64
67
68
69
69
71
72
72
73
74
76
77
78
79
80
84

Before setting out to attack any denite problem

it behooves us rst, without making any selection,
to assemble those truths that are obvious
as they present themselves to us
and afterwards, proceeding step by step,
to inquire whether any others can be deduced from these.
Rene Descartes (15961650)

The paradox of randomly varying phenomena that the aggregate ensemble behavior of unpredictable, irregular, individual observations is stable and
regular provides a basis for developing a systematic analysis approach.
Such an approach requires temporarily abandoning the futile task of predicting individual outcomes and instead focussing on characterizing the aggregate
ensemble in a mathematically appropriate manner. The central element is a
machinery for determining the mathematical probability of the occurrence
of each outcome and for quantifying the uncertainty associated with any attempts at predicting the intrinsically unpredictable individual outcomes. How
this probability machinery is assembled from a set of simple building blocks
and mathematical operations is presented in this chapter, along with the basic concepts required for its subsequent use for systematic analysis of random
57

Random Phenomena

phenomena. This chapter is therefore devoted to introducing probability in its

basic form rst, before we begin employing it in subsequent chapters to solve
problems involving random phenomena.

3.1

Building Blocks

A formal mathematical theory for studying random phenomena makes use

of certain words, concepts, and terminology in a more restricted technical
sense than is typically implied by common usage. We begin by providing the
denitions of:
Experiments; Trials; Outcomes
Sample space; Events
within the context of the machinery of probability theory .
1. Experiment: Any process that generates observable information about the
random phenomenon in question.
This could be the familiar sort of experiment in the sciences and engineering (such as the determination of the pH of a solution, the quantication
of the eectiveness of a new drug, or the determination of the eect of an
additive on gasoline consumption in an automobile engine); it also includes
the simple, almost articial sort, such as tossing a coin or some dice, drawing
a marble from a box, or a card from a well-shued deck. We will employ such
simple conceptual experiments with some regularity because they are simple
and easy to conceive mentally, but more importantly because they serve as
useful models for many practical, more complex problems, allowing us to focus on the essentials and avoid getting bogged down with unnecessary and
potentially distracting details. For example, in inspecting a manufactured lot
for defective parts, so long as the result of interest is whether the selected and
tested part is defective or not, the real experiment is well-modeled by the
toss of an appropriate coin.
2. Outcome: The result of an experiment.
This could be as simple as an attribute, such as the color of a marble drawn
from a box, or whether the part drawn from a manufactured lot is defective
or not; it could be a discrete quantity such as the number of heads observed
after 10 tosses of a coin, or the number of contaminants observed on a silicon
wafer; it could also be a continuous quantity such as the temperature of reactants in a chemical reactor, or the concentration of arsenic in a water sample.

Fundamentals of Probability Theory

3. Trial: A single performance of a well-dened experiment giving rise to an

outcome.
Random phenomena are characterized by the fact that each trial of the
same experiment performed under identical conditions can potentially produce dierent outcomes.
Closely associated with the possible outcomes of an experiment and crucial
to the development of probability theory are the concepts of the sample space
and events.
4. Sample Space: The set of all possible outcomes of an experiment.
If the elements of this set are individual, distinct, countable entities, then
the sample space is said to be discrete; if, on the other hand, the elements are
a continuum of values, the sample space is said to be continuous.
5. Event: A set of possible outcomes that share a common attribute.
The following examples illustrate these concepts.
Example 3.1 THE BUILDING BLOCKS OF PROBABILITY
In tossing a coin 3 times and recording the number of observed heads
and tails, identify the experiment, what each trial entails, the outcomes,
and the sample space.
Solution:
1. Experiment: Toss a coin 3 times; record the number of observed
heads (each one as an H) and tails (each one as a T);
2. Trial: Each trial involves 3 consecutive tosses of the coin;
3. Outcomes: Any one of the following is a possible outcome: HHH,
HHT, HTH, THH, HTT, THT, TTH, TTT.
4. Sample space: The set dened by
= {HHH, HHT, HT H, T HH, HT T, T HT, T T H, T T T } (3.1)
consisting of all possible 8 outcomes, is the sample space for this
experiment. This is a discrete sample space because there are 8
individual, distinct and countable elements.

Example 3.2 EVENTS ASSOCIATED WITH EXAMPLE 3.1

Identify some events associated with the experiment introduced in Example 3.1.
Solution:
The set A = {HHT, HT H, T HH} consists of those outcomes involving the occurrence of exactly two heads; it therefore represents the
event that exactly 2 heads are observed when a coin is tossed 3 times.

Random Phenomena
The set B = {T T T } consists of the only outcome involving the
occurrence of 3 tails; it therefore represents the event that 3 tails are
observed.
The set C = {HHH, HHT, HT H, T HH} consists of the outcomes
involving the occurrence of at least 2 heads; it represents the event that
at least 2 heads are observed.
Similarly, the set D = {HHH} represents the event that 3 heads
are observed.

A simple or elementary event is one that consists of one and only one
outcome of the experiment; i.e. a set with only one element. Thus, in Example 3.2, set B and set D are examples of elementary events. Any other event
consisting of more than one outcome is a complex or compound event. Sets A
and C in Example 3.2 are compound events. (One must be careful to distinguish between the set and its elements. The set B in Example 3.2 contains
one element, TTT, but the set is not the same as the element. Thus, even
though the elementary event consists of a single outcome, one is not the same
as the other).
Elementary events possess an important property that is crucial to the
development of probability theory:
An experiment conducted once produces one and only one outcome;
The elementary event consists of only one outcome;
One and only one elementary event can occur for every experimental
trial;
Therefore:
Simple (elementary) events are mutually exclusive.
In Example 3.2, sets B and D represent elementary events; observe that if one
occurs, the other one cannot. Compound events do not have this property. In
this same example, observe that if, after a trial, the outcome is HTH (a tail
sandwiched between two heads), event A has occurred (we have observed
precisely 2 heads), but so has event C, which requires observing 2 or more
heads. In the language of sets, the element HTH belongs to both set A and
set B.
An elementary event therefore consists of a single outcome and cannot be decomposed into a simpler event; a compound event, on the other
hand, consists of a collection of more than one outcome and can therefore be
composed from several simple events.

Fundamentals of Probability Theory

3.2

Operations

If rational analysis of random phenomena depends on working with the

aggregate ensemble of all possible outcomes, the next step in the assembly
of the analytical machinery is a means of operating on the component building blocks identied above. First the outcomes, already represented as events,
must be rmly rooted in the mathematical soil of sets so that established basic
set operations can be used to operate on events. The same manipulations of
standard algebra and the algebra of sets can then be used to obtain algebraic
relationships between the events that comprise the aggregate ensemble of the
random phenomenon in question. The nal step is the denition of functions
and accompanying operational rules that allow us to perform functional analysis on the events.

3.2.1

Events, Sets and Set Operations

We earlier dened the sample space as a set whose elements are all the
possible outcomes of an experiment. Events are also sets, but they consist of
only certain elements from that share a common attribute. Thus,

Events are subsets of the sample space.

Of all the subsets of , there are two special ones with important connotations: , the empty set consisting of no elements at all, and itself. In
the language of events, the former represents the impossible event, while the
latter represents the certain event.
Since they are sets, events are amenable to analysis using precisely the
same algebra of set operations union, intersection and complement
which we now briey review.
1. Union: A B represents the set of elements that are either in A or B. In
general,
A1 A2 A3 . . . Ak =

(3.2)

i=1

is the set of elements that are in at least one of the k sets, {Ai }k1 .
2. Intersection: A B represents the set of elements that are in both A and

Random Phenomena

B. In general,
A1 A2 A3 . . . Ak =

(3.3)

i=1

is the set of elements that are common to all the k sets, {Ai }k1 .
To discuss the third set operation requires two special sets: The universal set (or universe), typically designated , and the null (or empty) set,
typically designated . The universal set consists of all possible elements of
interest, while the null set contains no elements. (We have just recently introduced such sets above but in the specic context of the sample space of an
experiment; the current discussion is general and not restricted to the analysis
of randomly varying phenomena and their associated sample spaces.)
These sets have the special properties that for any set A,
A
A

=
=

(3.4)
(3.5)

(3.6)

(3.7)

and

3. Complement: A , the complement of set A, is always dened with respect

to the universal set ; it consists of all the elements of that are not in A. The
following are basic relationships associated with the complement operation:

(A )

(A B)
(A B)

= ; =
= A;
= A B
= A B

(3.8)
(3.9)
(3.10)
(3.11)

with the last two expressions known as DeMorgans Laws.

The rules of set algebra (similar to those of standard algebra) are as follows:
Commutative law:
AB = BA
AB = BA
Associative law:
(A B) C = A (B C)
(A B) C = A (B C)
Distributive Law:
(A B) C = (A C) (B C)
(A B) C = (A C) (B C)

Fundamentals of Probability Theory

TABLE 3.1:

Subsets and Events

Subset
Event

Certain event
Impossible event

A
Non-occurrence of event A
Event A or B
AB
AB
Events A and B

The following table presents some information about the nature of subsets of
interpreted in the language of events.
Note in particular that if A B = , A and B are said to be disjoint sets
(with no elements in common); in the language of events, this implies that
event A occurring together with event B is impossible. Under these circumstances, events A and B are said to be mutually exclusive.
Example 3.3 PRACTICAL ILLUSTRATION OF SETS AND
EVENTS
Samples from various batches of a polymer resin manufactured at a plant
site are tested in a quality control laboratory before release for sale. The
result of the tests allows the manufacturer to classify the product into
the following 3 categories:
1. Meets or exceeds quality requirement; Assign #1; approve for sale
as 1st quality.
2. Barely misses quality requirement; Assign #2; approve for sale as
2nd grade at a lower price.
3. Fails completely to meet quality requirement; Assign #3; reject as
poor grade and send back to be incinerated.
Identify the experiment, outcome, trial, sample space and the events
associated with this practical problem.
Solution:
1. Experiment: Take a sample of polymer resin and carry out the
prescribed product quality test.
2. Trial: Each trial involves taking a representative sample from each
polymer resin batch and testing it as prescribed.
3. Outcomes: The assignment of a number 1, 2, or 3 depending on
how the result of the test compares to the product quality requirements.
4. Sample space: The set = {1, 2, 3} containing all possible outcomes.
5. Events: The subsets of the sample space are identied as follows:
E0 = {}; E1 = {1}; E2 = {2}; E3 = {3}; E4 = {1, 2}; E5 =
{1, 3}; E6 = {2, 3}; E7 = {1, 2, 3}. Note that there are 8 in all. In
general, a set with n distinct elements will have 2n subsets.

Random Phenomena

Note that this real experiment is identical in spirit to the conceptual

experiment in which 3 identical ping-pong balls inscribed with the numbers
1, 2, and 3 are placed in a box, and each trial involves drawing one out
and recording the inscribed number found on the chosen ball. Employing the
articial surrogate may sometimes be a useful device to enable us focus on
the essential components of the problem.
Example 3.4 INTERPRETING EVENTS OF EXAMPLE 3.3
Provide a practical interpretation of the events identied in the quality
assurance problem of Example 3.3 above.
Solution:
E1 = {1} is the event that the batch is of 1st grade;
E2 = {2} is the event that the batch is of 2nd grade;
E3 = {3} is the event that the batch is rejected as poor grade.
These are elementary events; they are mutually exclusive.
E4 = {1, 2} is the event that the batch is either 1st grade or 2nd grade;
E5 = {1, 3} is the event that the batch is either 1st grade or rejected;
E6 = {2, 3} is the event that the batch is either 2nd grade or rejected.
These events are not elementary and are not mutually exclusive. For
instance, if a sample analysis indicates the batch is 1st grade, then the
events E1 , E4 and E5 have all occurred.
E7 = {1, 2, 3} = is the event that the batch is either 1st grade or 2nd
grade, or rejected;
E0 = is the event that the batch is neither 1st grade nor 2nd grade,
nor rejected.
Event E7 is certain to happen: the outcome of the experiment has to
be one of these three classicationsthere is no other alternative; event
E0 on the other hand is impossible, for the same reason.
Example 3.5 COMPOUND EVENTS FROM ELEMENTARY
EVENTS
Show how the compound events in Examples 3.3 and 3.4 can be composed from (or decomposed into) elementary events.
Solution:
The compound events E4 , E5 , E6 and E7 are related to the elementary
events E1 , E2 and E3 as follows:
E4

E1 E2

(3.12)

E1 E3

(3.13)

E2 E3

(3.14)

E1 E2 E3

(3.15)

Fundamentals of Probability Theory

TABLE 3.2:
Name
Allison
Ben
Chrissy
Daoud
Evan
Fouad
Gopalan
Helmut
Ioannis
Jim
Katie
Larry
Moe
Nathan
Olu

3.2.2

Class list and attributes

Sex
Age
Amount in wallet
Height
(M or F) (in years) (to the nearest $) (in inches)
F
21
$ 17.00
66
M
23
$ 15.00
72
F
23
$ 26.00
65
M
25
$ 35.00
67
M
22
$ 27.00
73
M
20
$ 15.00
69
M
21
$ 29.00
68
M
19
$ 13.00
71
M
25
$ 32.00
70
M
24
$ 53.00
74
F
22
$ 41.00
70
M
24
$ 28.00
72
M
21
$ 18.00
71
M
22
$ 6.00
68
M
26
$ 23.00
72

Set Functions

A function F (.), dened on the subsets of such that it assigns one and
only one real number to each subset of , is known as a set function. By
this denition, no one subset can be assigned more than one number by a set
function. The following examples illustrate the concept.
Example 3.6 SET FUNCTIONS DEFINED ON THE SET OF
STUDENTS IN A CLASSROOM
The following table shows a list of attributes associated with 15 students
in attendance on a particular day in a 600 level course oered at the
University of Delaware. Let set A be the subset of female students and
B, the subset of male students. Obtain the real number assigned by the
following set functions:
1. N (A), the total number of female students in class;
2. N (), the total number of students in class;
3. M (B), the sum total amount of money carried by the male students;

4. H(A),
the average height (in inches) of female students;
5. Y + (B), the maximum age, in years, of male students
Solution:
1. N (A) = 3;
2. N () = 15;
3. M (B) = $293.00;

4. H(A)
= 67 ins.

Random Phenomena
B
A
3
6

FIGURE 3.1: Venn Diagram for Example 3.7

5. Y + (B) = 26 years.

A set function Q is said to be additive if for every pair of disjoint subsets A

and B of ,
Q(A B) = Q(A) + Q(B)
(3.16)
For example, the set function N (.) in Example 3.6 is an additive set function.
Observe that the sets A and B in this example are disjoint; furthermore =
A B. Now, N () = N (A B) = 15 while N (A) = 3 and N (B) = 12. Thus
for this example,
N (A B) = N (A) + N (B)
(3.17)
+

However, H(.) is not an additive set function, and neither is Y (.).

In general, when two sets are not disjoint, i.e. when A B = , so that
the intersection is non-empty, it is easy to show (see exercise at the end of the
chapter) that if Q(.) is an additive set function,
Q(A B) = Q(A) + Q(B) Q(A B)

(3.18)

Example 3.7 ADDITIVE SET FUNCTION ON NONDISJOINT SETS

An old batch of spare parts contains 40 parts, of which 3 are defective;
a newly manufactured batch of 60 parts was added to make up a consolidated batch of 100 parts, of which a total of 9 are defective. Find
the total number of parts that are either defective or from the old batch.
Solution:
If A is the set of defective parts and B is the set of parts from the old
batch, and if N (.) is the number of parts in a set, then we seek N (AB).
The Venn diagram in Fig 3.1 shows the distribution of elements in each
set.
From Eq (3.18),
N (A B) = N (A) + N (B) N (A B) = 9 + 40 3 = 46

(3.19)

Fundamentals of Probability Theory

so that there are 46 parts that are either defective or from the old batch.

3.2.3

Probability Set Function

Let P (.) be an additive set function dened on all subsets of , the sample
space of all the possible outcomes of an experiment, such that:
1. P (A) 0 for every A ;
2. P () = 1;
3. P (A B) = P (A) + P (B) for all mutually exclusive events A and B
then P (.) is a probability set function.
Remarkably, these three simple rules (axioms) due to Kolmogorov, are
sucient to develop the mathematical theory of probability. The following
are important properties of P (.) arising from these axioms.
1. To each event A, it assigns a non-negative number, P (A), its probability;
2. To the certain event , it assigns unit probability;
3. The probability that either one or the other of two mutually exclusive
events A, B will occur is the sum of the probabilities that each event
will occur.
The following corollaries are important consequences of the foregoing three
axioms:
Corollary 1. P (A ) = 1 P (A).
The probability of non-occurrence of A is 1 minus the probability of its occurrence. Equivalently, the combined probability of the occurrence of an event
and of its non-occurrence add up to 1. This follows from the fact that
= A A ;

(3.20)

that A and A are disjoint sets; that P (.) is an additive set function, and that
P () = 1.
Corollary 2. P () = 0.
The probability of an impossible event occurring is zero. This follows from the
fact that = and from corollary 1 above.
Corollary 3. A B P (A) P (B).
If A is a subset of B then the probability of occurrence of A is less than, or
equal to, the probability of the occurrence of B. This follows from the fact
that under these conditions, B can be represented as the union of 2 disjoint
sets:
(3.21)
B = A (B A )

Random Phenomena

and from the additivity of P (.),

P (B) = P (A) + P (B A )

(3.22)

so that from the non-negativity of P (.), we obtain,

P (B) P (A)

(3.23)

Corollary 4. 0 P (A) 1 for all A .

The probability of any realistic event occurring is bounded between zero and
1. This follows directly from the rst 2 axioms and from corollary 3 above.
Corollary 5. P (A B) = P (A) + P (B) P (A B) for any pair of subsets
A and B.
This follows directly from the additivity of P (.) and results presented earlier
in Eq (3.18).

3.2.4

Final considerations

Thus far, in assembling the machinery for dealing with random phenomena
by characterizing the aggregate ensemble of all possible outcomes, we have
encountered the sample space , whose elements are all the possible outcomes
of an experiment; we have presented events as collections of these outcomes
(and hence subsets of ); and nally P (.), the probability set function dened
on subsets of , allows the axiomatic denition of the probability of an event.
What we need next is a method for actually obtaining any particular probability P (A) once the event A has been dened. Before we can do this, however,
for completeness, a set of nal considerations are in order.
Even though as presented, events are subsets of , not all subsets of
are events. There are all sorts of subtle mathematical reasons for this, including the (somewhat unsettling) case in which consists of innitely many
elements, as is the case when the outcome is a continuous entity and can
therefore take on values on the real line. In this case, clearly, is the set of
all real numbers. A careful treatment of these issues requires the introduction
of Borel elds (see for example, Kingman and Taylor, 1966, Chapter 111 ).
This is necessary because, as the reader may have anticipated, the calculus of
probability requires making use of set operations, unions and intersections, as
well as sequences and limits of events. As a result, it is important that sets
resulting from such operations are themselves events. This is strictly true of
Borel elds.
Nevertheless, for all practical purposes, and most practical applications, it
is often not necessary to distinguish between the subsets of and genuine
events. For the reader willing to accept on faith the end resultthe probability
1 Kingman, J.F.C. and Taylor, S.J., Introduction to the Theory of Measure and Probability, Cambridge University Press, 1966.

Fundamentals of Probability Theory

distribution function presented fully in Chapters 4 and 5a lack of detailed

knowledge of such subtle, but important, ne points will not constitute a
hinderance to the appropriate use of the tool.

3.3

Probability

We are now in a position to discuss how to use the machinery we have

assembled above to determine the probability of any particular event A.

3.3.1

The Calculus of Probability

Once the sample space for any random experiment has been specied
and the events (subsets of the sample space) identied, the following is the
procedure for determining the probability of any event A, based on the important property that elementary events are mutually exclusive:

Assign probabilities to all the elementary events in ;

Determine the probability of any compound event from the probability
of the elementary events making up the compound event of interest.

The procedure is particularly straightforward to illustrate for discrete

sample spaces with a countable number of elements. For example, if =
{d1 , d2 , . . . dN } consists of N outcomes, then there are N elementary events,
Ei = {di }. To each of these elementary events, we assign the probability pi (we
will discuss shortly how such assignments are made) subject to the constraint

that N
i pi = 1. From here, if
A = {d1 , d2 , d4 }

(3.24)

and if P (A) represents the probability of event A occurring, then,

P (A) = p1 + p2 + p4

(3.25)

B = {d3 , d5 , . . . , dN }

(3.26)

P (B) = 1 p1 p2 p4

(3.27)

and for
then
The following examples illustrate how probabilities pi may be assigned to
elementary events.

Random Phenomena
Example 3.8 ASSIGNMENTS FOR EQUIPROBABLE OUTCOMES
The experiment of tossing a coin 3 times and recording the observed
number of heads and tails was considered in Examples 3.1 and 3.2.
There the sample space was obtained in Eq (4.5) as:
= {HHH, HHT, HT H, T HH, HT T, T HT, T T H, T T T },

(3.28)

a set with 8 elements that comprise all the possible outcomes of the experiment. Several events associated with this experiment were identied
in Example 3.2.
If there is no reason for any one of the 8 possible outcomes to be
any more likely to occur that any other one, the outcomes are said to
be equiprobable and we assign a probability of 1/8 to each one. This
gives rise to the following equiprobale assignment of probability to the
8 elementary events:

Note that

P (E1 )

P {HHH} = 1/8

P (E2 )

P {HHT } = 1/8

P (E3 )
..
.

P {HT H} = 1/8

P (E7 )

P {T T H} = 1/8

P (E8 )

P {T T T } = 1/8

8

1

pi =

P (Ei ) = 1

(3.29)

(3.30)

And now because the event

A = {HHT, HT H, T HH}

(3.31)

identied in Example 3.2 (the event that exactly 2 heads are observed)
consists of three elementary events E2 , E3 and E4 , so that
A = E2 E3 E4 ,

(3.32)

because these sets are disjoint, we have that

P (A) = P (E2 ) + P (E3 ) + P (E4 ) = 3/8

(3.33)

Similarly, for the events B, C and D identied in Example 3.2, we have,

respectively, P (C) = 4/8 = 0.5 and P (B) = P (D) = 1/8.

Other means of probability assignment are possible, as illustrated by the following example.
`
Example 3.9 ALTERNATIVE ASSIGNMENTS FROM APRIORI KNOWLEDGE
Consider the manufacturing example discussed in Examples 3.3 and 3.4.

Fundamentals of Probability Theory

Suppose that historically 75% of manufactured batches have been of 1st

grade, 15% of grade 2 and the rest rejected. Assuming that nothing has
changed in the manufacturing process, use this information to assign
probabilities to the elementary events identied in Example 3.3, and
determine the probabilities for all the possible events associated with
this problem.
Solution:
Recall from Examples 3.3 and 3.4 that the sample space in this case is
= {1, 2, 3} containing all 3 possible outcomes; the 3 elementary events
are E1 = {1}; E2 = {2}; E3 = {3}; the other events (the remaining
subsets of the sample space) had been previously identied as: E0 =
{}; E4 = {1, 2}; E5 = {1, 3}; E6 = {2, 3}; E7 = {1, 2, 3}.
From the provided information, observe that it is entirely reasonable
to assign probabilities to the elementary events as follows:
P (E1 )

0.75

(3.34)

P (E2 )

0.15

(3.35)

P (E3 )

0.10

(3.36)

Note that these probabilities sum to 1 as required. From here we may

now compute the probabilities for the other events:
P (E4 )

P (E1 ) + P (E2 ) = 0.9

(3.37)

P (E5 )

P (E1 ) + P (E3 ) = 0.85

(3.38)

P (E6 )

P (E2 ) + P (E3 ) = 0.25

(3.39)

For completeness, we note that P (E0 ) = 0; and P (E7 ) = 1.

3.3.2

Implications

It is worth spending a few moments to reect on the results obtained from

this last example.
The premise is that the manufacturing process is subject to many sources
of variability so that despite having an objective of maintaining consistent
product quality, its product may still fall into any one of the three quality grade
levels in an unpredictable manner. Nevertheless, even though the particular
grade (outcome) of any particular tested sample (experiment) is uncertain and
unpredictable, this example shows us how we can determine the probability
of the occurrence of the entire collection of all possible events. First, the
more obvious elementary events: for example the probability that a sample
will be grade 1 is 0.75. Even the less obvious complex events have also been
characterized. For example, if we are interested in the probability of making
any money at all on what is currently being manufactured, this is the event E4
(producing saleable grade 1 or 2 material); the answer is 0.9. The probability

Random Phenomena
G (Graduate)
12

U (Undergraduate)
38

FIGURE 3.2: Venn diagram of students in a thermodynamics class

of not making grade 1 material is 0.25 (the non-occurrence of event E1 , or
equivalently the event E6 ).
With this example, what we have actually done is to construct a model of
how the probability of the occurrence of events is distributed over the entire
collection of all possible events. In subsequent chapters, we make extensive use
of the mechanism illustrated here in developing probability models for complex
random phenomena, proceeding from the probability of elementary events
and employing the calculus of probability to obtain the required probability
distribution expressions.

3.4
3.4.1

Conditional Probability
Illustrating the Concept

Consider a chemical engineering thermodynamics class consisting of 50 total students of which 38 are undergraduates and the rest are graduate students.
Of the 12 graduate students, 8 are chemistry students; of the 38 undergraduates, 10 are chemistry students. We may dene the following sets:
, the (universal) set of all students (50 elements);
G, the set of graduate students (12 elements);
C, the set of chemistry students (18 elements)
Note that the set G C, the set of graduate chemistry students, contains 8
elements. (See Fig 3.2.)
We are interested in the following problem: select a student at random;
given that the choice results in a chemistry student, what is the probability
that she/he is a graduate student? This is a problem of nding the probability
of the occurrence of an event conditioned upon the prior occurrence of another
one.

Fundamentals of Probability Theory

B
A

FIGURE 3.3: The role of conditioning Set B in conditional probability

In this particular case, the total number of students in the chemistry group
is 18, of which 8 are graduates. The required probability is thus precisely
that of choosing one of the 8 graduate students out of all the possible 18
chemistry students; and, assuming equiprobable outcomes, this probability is
8/18. (Note also from the denition of the sets above, that P (C) = 18/50 and
P (G C) = 8/50.)
We may now formalize the just illustrated concept as follows.

3.4.2

Formalizing the Concept

For two sets A and B, the conditional probability of A given B, denoted

P (A|B), is dened as
P (A B)
P (A|B) =
(3.40)
P (B)
where P (B) > 0.
Observe how the set B now plays the role that played in unconditional
probability (See Fig 3.3); in other words, the process of conditioning restricts
the set of relevant outcomes to B . In this sense, P (A) is really P (A|),
which, according to Eq. (5.33), may be written as
P (A|) =

P (A)
P (A )
=
P ()
1

(3.41)

Returning now to the previous illustration, we see that the required quantity is P (G|C), and by denition,
P (G|C) =

8/50
P (G C)
=
= 8/18
P (C)
18/50

(3.42)

as obtained previously. The unconditional probability P (G) is 12/50.

The conditional probability P (A|B) possesses all the required properties
of a probability set function dened on subsets of B:

Random Phenomena
B
A

A B*

FIGURE 3.4: Representing set A as a union of 2 disjoint sets

1. 0 < P (A|B) 1;
2. P (B|B) = 1;
3. P (A1 A2 |B) = P (A1 |B) + P (A2 |B) for disjoint A1 and A2 .
The following identities are easily derived from the denition given above for
P (A|B):
P (A B) =

P (B)P (A|B); P (B) > 0

(3.43)

P (A)P (B|A); P (A) > 0

(3.44)

Conditional probability is a particularly important concept in science and

engineering applications because we often have available to us some `
a-priori
knowledge about a phenomenon; the required probabilities then become conditioned upon the available information.

3.4.3

Total Probability

It is possible to obtain total probabilities when only conditional probabilities are available. We now present some very important results relating
conditional probabilities to total probability.
Consider events A and B, not necessarily disjoint. From the Venn diagram
in Fig 3.4, we may write A as the union of 2 disjoint sets as follows:
A = (A B) (A B )

(3.45)

In words, this expression states that the points in A are made up of two
groups: the points in A that are also in B, and the points in A that are not
in B. And because the two sets are disjoint, so that the events they represent
are mutually exclusive, we have:
P (A) = P (A B) + P (A B )

(3.46)

Fundamentals of Probability Theory

A
A B2

A B3

A B1
B1

A Bk
B2

.....

FIGURE 3.5: Partitioned sets for generalizing total probability result

and from the denition of conditional probability, we obtain:
P (A) = P (A|B)P (B) + P (A|B )P (B )

(3.47)

P (A) = P (A|B)P (B) + P (A|B )[1 P (B)]

(3.48)

or, alternatively,

This powerful result states that the (unconditional, or total) probability of

an event A is a weighted average of two partial (or conditional) probabilities:
the probability conditioned on the occurrence of B and the probability conditioned upon the non-occurrence of B; the weights, naturally, are the respective
probabilities of the conditioning event.
This may be generalized as follows: First we partition into a union of k
disjoint sets:
k

Bi
(3.49)
= B1 B2 B3 . . . Bk =
i=1

For any A that is an arbitrary subset of , observe that

A = (A B1 ) (A B2 ) (A B3 ) . . . (A Bk )

(3.50)

which is a partitioning of the set A as a union of k disjoint sets (See Fig 3.5).
As a result,
P (A) = P (A B1 ) + P (A B2 ) + . . . + P (A Bk )

(3.51)

P (A Bi ) = P (A|Bi )P (Bi )

(3.52)

but since
we immediately obtain
P (A) = P (A|B1 )P (B1 ) + P (A|B2 )P (B2 ) + . . . + P (A|Bk )P (Bk )
Thus:
P (A) =

k

i=1

P (A|Bi )P (Bi )

(3.53)

(3.54)

Random Phenomena

an expression that is sometimes referred to as the Theorem of total probability used to compute total probability P (A) from P (A|Bi ) and P (Bi ).
The following example provides an illustration.
Example 3.10 TOTAL PROBABILITY
A company manufactures light bulbs of 3 dierent types (T1 , T2 , T3 )
some of which are defective right from the factory. From experience
with the manufacturing process, it is known that the fraction of defective
Type 1 bulbs is 0.1; Types 2 and 3 have respective defective fractions
of 1/15 and 0.2.
A batch of 200 bulbs were sent to a quality control laboratory for
testing: 100 Type 1, 75 Type 2, and 25 Type 3. What is the probability
of nding a defective bulb?
Solution:
The supplied information may be summarized as follows: Prior conditional probabilities of defectiveness,
P (D|T1 ) = 0.1; P (D|T2 ) = 1/15; P (D|T3 ) = 0.2;

(3.55)

and the distribution of numbers of bulb types in the test batch:

N (T1 ) = 100; N (T2 ) = 75; N (T3 ) = 25.

(3.56)

Assuming equiprobable outcomes, this number distribution immediately implies the following:
P (T1 ) = 100/200 = 0.5; P (T2 ) = 0.375; P (T3 ) = 0.125

(3.57)

From the expression for total probability in Eq.(3.53), we have:

P (D) = P (D|T1 )P (T1 ) + P (D|T2 )P (T2 ) + P (D|T3 )P (T3 ) = 0.1 (3.58)

3.4.4

Bayes Rule

A question of practical importance in many applications is:

Given P (A|Bi ) and P (Bi ), how can we obtain P (Bi |A)?

In other words, how can we reverse probabilities?

The total probability expression we have just derived provides a way to
answer this question. Note from the denition of conditional probability that:
P (Bi |A) =

P (Bi A)
P (A)

(3.59)

Fundamentals of Probability Theory

but
P (Bi A) = P (A Bi ) = P (A|Bi )P (Bi )

(3.60)

which, when substituted into (3.59), gives rise to a very important result:
P (A|Bi )P (Bi )
P (Bi |A) = k
i=1 P (A|Bi )P (Bi )

(3.61)

This famous result, due to the Revd. Thomas Bayes (1763), is known as
Bayes Rule and we will encounter it again in subsequent chapters. For now,
it is an expression that can be used to compute the (unknown) `
a-posteriori
probability P (Bi |A) of events Bi from the `
a-priori probabilities P (Bi ) and
the (known) conditional probabilities P (A|Bi ). It indicates that the unknown
a-posteriori probability is proportional to the product of the `
`
a-priori probability and the known conditional probability we wish to reverse; the constant
of proportionality is the reciprocal of the total probability of event A.
This result is the basis of an alternative approach to data analysis (discussed in Section 14.6 of Chapter 14) wherein available prior information is
incorporated in a systematic fashion into the analysis of experimental data.

3.5

Independence

For two events A and B, the conditional probability P (A|B) was dened
earlier in Eq.(5.33). In general, this conditional probability will be dierent
from the unconditional probability P (A), indicating that the knowledge that
B has occurred aects the probability of the occurrence of A.
However, when the occurrence of B has no eect on the occurrence of A,
then the events A and B are said to be independent and
P (A|B) = P (A)

(3.62)

so that the conditional and unconditional probabilities are identical. This will
occur when
P (A B)
= P (A)
(3.63)
P (B)
so that
P (A B) = P (A)P (B)

(3.64)

Thus, when events A and B are independent, the probability of the two events
happening concurrently is the product of the probabilities of each one occurring by itself. Note that the expression in Eq.(3.64) is symmetric in A and B
so that if A is independent of B, then B is also independent of A.
This is another in the collection of very important results used in the

Random Phenomena

development of probability models. We already encountered the rst one: that

when two events A and B are mutually exclusive, P (A or B) = P (A B) =
P (A) + P (B). Under these circumstance, P (A B) = 0, since the event A
occurring together with event B is impossible when A and B are mutually
exclusive. Eq.(3.64) is the complementary result: that when two events are
independent P (A and B) = P (A B) = P (A)P (B).
Extended to three events, the result states that the events A, B, C are
independent if all of the following conditions hold:
P (A B) =

P (A)P (B)

(3.65)

P (B C) =
P (A C) =

P (B)P (C)
P (A)P (C)

(3.66)
(3.67)

P (A)P (B)P (C)

(3.68)

P (A B C) =

implying more than just pairwise independence.

3.6

Summary and Conclusions

This chapter has been primarily concerned with assembling the machinery
of probability from the building blocks of events in the sample space,
the collection of all possible randomly varying outcomes of an experiment.
We have seen how the probability of an event A arises naturally from the
probability set function, an additive set function dened on the set that
satises the three axioms of Kolmogorov.
Having established the concept of probability and how the probability of
any subset of can be computed, a straightforward extension to special events
restricted to conditioning sets in led to the related concept of conditional
probability. The idea of total probability, the result known as Bayes rule, and
especially the concept of independence all arise naturally from conditional
probability and have profound consequences for random phenomena analysis
that cannot be fully appreciated until much later.
We note in closing that the presentation of probability in this chapter (especially as a tool for solving problems involving randomly varying phenomena)
is still quite rudimentary because the development is not quite complete yet.
The nal step in the development of the probability machinery, undertaken
primarily in the next chapter, requires the introduction of the random variable, X, from which the analysis tool, the probability distribution function,
f (x), emerges and is fully characterized.
Here are some of the main points of the chapter again:
Events, as subsets of the sample space, , can be elementary (simple) or
compound (complex); if elementary, then they are mutually exclusive; if
compound, then they can be composed from several simple events.

Fundamentals of Probability Theory

Once probabilities have been assigned to all elementary events in , then

P (A), the probability of any other subset A of , can be determined on
the basis of the probability set function P (.) dened on all subsects of
according to the three axioms of Kolmogorov:
1. P (A) 0 for every A ;
2. P () = 1;
3. P (A B) = P (A) + P (B) for all mutually exclusive events A and
B.
Conditional Probability: For any two events A and B in , the conditional probability P (A|B) is given by
P (A|B) =

P (A B)
P (B)

Total Probability: Given conditional (partial) probabilities P (A|Bi ), and

P (Bi ) for each conditioning set, the unconditional (total) probability of
A is given by

P (A) =
P (A|Bi )P (Bi )
i=1

Mutual Exclusivity: Two events A and B are mutually exclusive if

P (A B) = P (A) + P (B),
in which case P (A B) = 0.
Independence: Two events A and B are independent if
P (A B) = P (A)P (B)

REVIEW QUESTIONS
1. What are the ve basic building blocks of probability theory as presented in Section 3.1? Dene each one.
2. What is a simple (or elementary) event and how is it dierent from a complex
(or compound) event?
3. Why are elementary events mutually exclusive?
4. What is the relationship between events and the sample space?
5. In the language of events, what does the empty set, , represent? What does the
entire sample space, , represent?

Random Phenomena

6. Given two sets A and B, in the language of events, what do the following sets
represent: A ; A B; and A B?
7. What does it mean that two events A and B are mutually exclusive?
8. What is a set function in general and what is an additive set function in particular?
9. What are the three fundamental properties of a probability set function (also
known as Kolmogorovs axioms)?
10. How is the probability of any event A determined from the elements events
in ?
11. For any two sets A and B, what is the denition of P (A|B), the conditional
probability of A given B? If the two sets are disjoint such that A B = , in words,
what does P (A|B) mean in this case?
12. How does one obtain total probability from partial (i.e., conditional) probabilities?
13. What is Bayes rule and what is it used for?
14. Given P (A|Bi ) and P (Bi ), how does one reverse the probability to determine
P (Bi |A)?
15. What does it mean for two events A and B to be independent?
16. What is P (A B) when two events A and B are (i) mutually exclusive, and (ii)
independent?

EXERCISES
Section 3.1
3.1 When two diceone black with white dots, the other black with white dots
are tossed once, simultaneously, and the number of dots shown on each dies top
face after coming to rest are recorded as an ordered pair (nB , nW ), where nB is the
number on the black die, and nW the number on the white die,
(i) identify the experiment, what constitutes a trial, the outcomes, and the sample
space.
(ii) If the sum of the numbers on the two dice is S, i.e.,
S = nB + nW ,

(3.69)

enumerate all the simple events associated with the observation S = 7.

3.2 In an opinion poll, 20 individuals selected at random from a group of college
students are asked to indicate which of three optionsapprove, disapprove, indierent best matches their individual opinions of a new campus policy. Let n0 be the
number of indierent students, n1 the number that approve, and n2 the number that

Fundamentals of Probability Theory

disapprove, so that the outcome of one such opinion sample is the ordered triplet
(n0 , n1 , n2 ). Write mathematical expressions in terms of the numbers n0 , n1 , and n2
for the following events:
(i) A = {Unanimous support for the policy}; and A , the complement of A.
(ii) B = {More students disapprove than approve}; and B .
(iii) C = {More students are indierent than approve};
(iv) D = {The majority of students are indierent }.
Section 3.2
3.3 Given the following two sets A and B:
A

{x : x = 1, 3, 5, 7, . . .}

(3.70)

{x : x = 0, 2, 4, 6, . . .}

(3.71)

nd A B and A B.
3.4 Let Ak = {x : 1/(k + 1) x 1} for k = 1, 2, 3, . . .. Find the set B dened by:
B = A1 A2 A3 . . . =

(3.72)

i=1

3.5 For sets A, B, C, subsets of the universal set , establish the following identities:
(A B)

A B

(3.73)

A B

A (B C)

(A B) (A C)

(3.75)

A (B C)

(A B) (A C)

(3.76)

(A B)

(3.74)

3.6 For every pair of sets A, B, subsets of the sample space upon which the
probability set function P (.) has been dened, prove that:
P (A B) = P (A) + P (B) P (A B)

(3.77)

3.7 In a certain engineering research and development company, apart from the support sta which number 25, all other employees are either engineers or statisticians
or both. The total number of employees (including the support sta) is 100. Of
these, 50 are engineers, and 40 are statisticians; the number of employees that are
both engineers and statisticians is not given. Find the probability that an employee
chosen at random is not one of those classied as being both an engineer and a
statistician.
Section 3.3
3.8 For every set A, let the set function Q(.) be dened as follows:

f (x)
Q(A) =

(3.78)

where

x
1
2
f (x) =
; x = 0, 1, 2, . . .
3
3

(3.79)

Random Phenomena

If A1 = {x : x = 0, 1, 2, 3} and A2 = {x : x = 0, 1, 2, 3, . . .} nd Q(A1 ) and Q(A2 ).

3.9 Let the sample space for a certain experiment be = { : 0 < < }. Let
A represent the event A = { : 4 < < }. If the probability set function
P (A) is dened for any subset of the sample space according to:

ex dx
(3.80)
P (A) =
A

evaluate P (A), P (A ), P (A A )
3.10 For the experiment of rolling two diceone black with white dots, the other
black with white dotsonce, simultaneously, presented in Exercise 3.1, rst obtain
, the sample space, and, by assigning equal probability to each of the outcomes,
determine the probability of the following events:
(i) A = {nB + nW = 7}, i.e. the sum is 7;
(ii) B = {nB < nW };
(iii) B , the complement of B;
(iv) C = {nB = nW }, i.e. the two dice show the same number;
(v) D = {nB + nW = 5 or 9}.
3.11 A black velvet bag contains three red balls and three green balls. Each experiment involves drawing two balls at once, simultaneously, and recording their colors,
R for red, and G for green.
(i) Obtain the sample space, assuming that balls of the same color are indistinguishable.
(ii) Upon assigning equal probability to each element in the sample space, determine
the probability of drawing two balls of dierent colors.
(iii) If the balls are distinguishable and numbered from 1 to 6, and if the two balls
are drawn sequentially, not simultaneously, now obtain the sample space and from
this determine the probability of drawing two balls of dierent colors.
3.12 An experiment is performed by selecting a card from an ordinary deck of 52
playing cards. The outcome, , is the type of card chosen, classied as: Ace,
King, Queen, Jack, and others. The random variable X() assigns the
number 4 to the outcome if is an Ace; X() = 3 if the outcome is a King;
X() = 2 if the outcome is a Queen, and X() = 1 if the outcome is a Jack;
X() = 0 for all other outcomes.
(i) What is the space V of this random variable?
(ii) If the probability set function P () dened on the subsets of the original sample
space assigns a probability 1/52 to each of these outcomes, describe the induced
probability set function PX (A) induced on all the subsets of the space V by this
random variable.
(iii) Describe a physical (scientic or engineering) problem for which the above would
be a good surrogate model.
3.13 Obtain the sample space, , for the experiment involving tossing a fair coin 4
times. Upon assigning equal probability to each outcome, determine the probabilities
of obtaining, 0, 1, 2, 3, or 4 heads. Conrm that your result is consistent with the

Fundamentals of Probability Theory

postulate that the probability model for this phenomenon is given by the probability
distribution function:
n!
f (x) =
(3.81)
px (1 p)nx
x!(n x)!
where f (x) is the probability of obtaining x heads in n = 4 tosses, and p = 12 is the
probability of obtaining a head in a single toss of the coin. (See Chapter 8.)
3.14 In the fall of 2007, k students born in 1989 attended an all-freshman introductory general engineering class at the University of Delaware. Conrm that if p is the
probability that at least two of the students have the same birthday then:
1p=

1
365!
(365 k)! (365)k

(3.82)

Show that for a class with 23 or more students born in 1989, the probability of at
least 2 students sharing the same birthday, is more than 1/2, i.e., if k > 23 then
p > 1/2.
Sections 3.4 and 3.5
3.15 Six simple events, with probabilities P (E1 ) = 0.11; P (E2 ) = P (E5 ) =
0.20; P (E3 ) = 0.25; P (E4 ) = 0.09; P (E6 ) = 0.15, constitute the entire set of outcomes of an experiment. The following events are of interest:
A = {E1 , E2 }; B = {E2 , E3 , E4 }; C = {E5 , E6 }; D = {E1 , E2 , E5 }
Determine the following probabilities:
(i) P (A), P (B), P (C), P (D);
(ii) P (A B), P (A B); P (A D), P (A D); P (B C), P (B C);
(iii) P (B|A), P (A|B); P (B|C), P (D|C)
Which of the events A, B, C and D are mutually exclusive?
3.16 Assuming that giving birth to a boy or a girl is equally likely, and further, that
no multiple births have occurred, rst, determine the probability of a family having
three boys in a row. Now consider the conjecture (based on empirical data) that, for
a family that has already had two boys in a row, the probability of having a third
boy is 0.8. Under these conditions, what is now the probability of a family having
three boys in a row?
3.17 As a follow-up to the concept of independence of two events A and B,
Event A is said to be attracted to event B if
P (A|B) > P (A)

(3.83)

Event A is said to be repelled by event B if

P (A|B) < P (A)

(3.84)

(Of course, when P (A|B) = P (A), the two events have been previously identied
as independent.) Establish the result that if B attracts A, then: (i) A attracts B

Random Phenomena

(mutual attraction); and (ii) B repels A.

3.18 Show that if A and B are independent, then A and B are also independent.
3.19 Show that for two events A and B, P (A B|A B) P (A B|A). State the
condition for equality.
3.20 An exchange student from Switzerland, who is male, has been assigned to be
your partner in an introductory psychology class. As part of a class assignment, he
responds to your question about his family by stating only that he comes from a
family of two children, without specifying whether he is the older or the younger.
What is the probability that his sibling is female? Assume equal probability of having a boy or girl. Why does this result seem counterintuitive at rst?
3.21 A system consisting of two components A and B that are connected in series functions if both of them function. If P (A), the probability that component A
functions, is 0.99, and the probability that component B functions is 0.90, nd the
probability that this series system functions, assuming that whether one component
functions or not is independent of the status of the other component. If these components are connected in parallel, the system fails (i.e., will not function) only if
both components fail. Assuming independence, determine the probability that the
parallel system functions. Which probability is higher and why is it reasonable to
expect a higher probability from the system in question?
3.22 The functioning status of a complex system that consists of several components
arranged in series and parallel and with cross-links (i.e., whether the system functions
or not) can be determined from the status of a keystone component, Ck . If the
probability that the keystone component for a particular system functions is given
as P (Ck ) = 0.9 and the probability that the the system function when the keystone
functions, P (S|Ck ), is given as 0.9, with the complementary probability that the
system functions when the keystone does not function, P (S|Ck ), given as 0.8, nd
the unconditional probability, P (S), that the system functions.

APPLICATION PROBLEMS
3.23 Patients suering from manic depression and other similar disorders are sometimes treated with lithium, but the dosage must be monitored carefully because
lithium toxicity, which is often fatal, can be dicult to diagnose. A new assay used
to determine lithium concentration in blood samples is being promoted as a reliable
way to diagnose lithium toxicity because the assay result is purported to correlate
very strongly with toxicity.
A careful study of the relationship between this blood assay and lithium toxicity
in 150 patients yielded results summarized in Table 3.3. Here A+ indicates high
lithium concentrations in the blood assay and A indicates low lithium concentration; L+ indicates conrmed Lithium toxicity and L indicates no lithium toxicity.
(i) From these data, compute the following probabilities regarding the lithium toxicity status of a patient chosen at random::

Fundamentals of Probability Theory

TABLE 3.3:

Lithium toxicity
study results
Lithium Toxicity
Assay L+
L
Total
A+
30
17
47
A
21
82
103
Total

150

1. P (L+ ), the probability that the patient has lithium toxicity (regardless of the
blood assay result);
2. P (L+ |A+ ), the conditional probability that the patient has lithium toxicity
given that the blood assay result indicates high lithium concentration. What
does this value indicate about the potential benet of having this assay result
available?
3. P (L+ |A ) the conditional probability that the patient has lithium toxicity
given that the blood assay result indicates low lithium concentration. What
does this value indicate about the potential for missed diagnoses?
(ii) Compute the following probabilities regarding the blood lithium assay:
1. P (A+ ), the (total) probability of observing high lithium blood concentration
(regardless of actual lithium toxicity status);
2. P (A+ |L+ ) the conditional probability that the blood assay result indicates
high lithium concentration given that the patient indeed has lithium toxicity.
Why do you think that this quantity is referred to as the sensitivity of the
assay, and what does the computed value indicate about the sensitivity of
the particular assay in this study?
3. From information about P (L+ ) (as the prior probability of lithium toxicity)
along with the just computed values of P (A+ ) and P (A+ |L+ ) as the relevant
assay results, now use Bayes Rule to compute P (L+ |A+ ) as the posterior
probability of lithium toxicity after obtaining assay data, even though it has
already been computed directly in (i) above.
3.24 An experimental crystallizer produces ve dierent polymorphs of the same
crystal via mechanisms that are currently not well-understood. Types 1, 2 and 3
are approved for pharmaceutical application A; Types 2, 3 and 4 for a dierent
application B; Type 5 is mostly unstable and has no known application. How much
of each type is made in any batch varied randomly, but with the current operating
procedure, 30% of the total product made by the crystallizer in a month is of Type
1; 20% is of Type 2, with the same percentage of Types 3 and 4; and 10% is of Type
5. Assuming that the polymorhps can be separated without loss,
(i) Determine the probability of making product in a month that can be used for
application A;
(ii) Given a batch ready to be shipped for application B, what is the probabilities
that any crystal selected at random is of Type 2? What is the probability that it is
of Type 3 or Type 4. State any assumptions you may need to make.

Random Phenomena

(iii) What is the probability that an order change to one for application A can be
lled from a batch ready to be shipped for application B?
(iv) What is the converse probability that an order change to one for application B
can be lled given a batch that is ready to be shipped for application A?
3.25 A test for a relatively rare disease involves taking from the patient an appropriate tissue sample which is then assessed for abnormality. A few sources of error
are associated with this test. First, there is a small, but non-zero probability, s ,
that the tissue sampling procedure will miss abnormal cells primarily because these
cells (at least in the earlier stages) being relatively few in number, are randomly distributed in the tissue and tend not to cluster. In addition, during the examination
of the tissue sample itself, there is a probability, f , of failing to identify an abnormality when present; and a probability, m , of misclassifying a perfectly normal cell
as abnormal.
If the proportion of the population with this disease who are subjected to this
test is D ,
(i) In terms of the given parameters, determine the probability that the test result is
correct. (Hint: rst compute the probability that the test result is incorrect, keeping
in mind that the test may identify an abnormal cell incorrectly as normal, or a
normal cell as abnormal.)
(ii) Determine the probability of a false positive (i.e., returning an abnormality result
when none exists).
(iii) Determine the probability of a false negative (i.e., failing to identify an abnormality that is present).
3.26 Repeat Problem 3.25 for the specic values of s = 0.1; f = 0.05; m = 0.1
for a population in which 2% have the disease. A program sponsored by the Center
for Disease Control (CDC) is to be aimed at reducing the number of false positives
and/or false negatives by reducing one of the three probabilities s , f , and m .
Which of these parameters would you recommend and why?
3.27 A manufacturer of at-screen TVs purchases pre-cut glass sheets from three
dierent manufacturers, M1 , M2 and M3 , whose products are characterized in the
TV manufacturers incoming material quality control lab as premier grade, Q1 ,
acceptable grade, Q2 , and marginal grade, Q3 , on the basis of objective, measurable quality criteria, such as inclusions, warp, etc. Incoming glass sheets deemed
unacceptable are rejected and returned to the manufacturer. An incoming batch of
425 accepted sheets has been classied by an automatic classifying system as shown
in the table below.
Quality
Manufacturer
M1
M2
M3

Premier
Q1
110
150
76

Acceptable
Q2
25
33
13

Marginal
Q3
15
2
1

Total
150
185
90

If a sheet is selected at random from this batch,

(i) Determine the probability that it is of premier grade; also determine the probability that it is not of marginal grade.

Fundamentals of Probability Theory

(ii) Determine the probability that it is of premier grade given that it is from
manufacturer M1 ; also determine the probability that is of premier grade given
that it is from either manufacturer M2 or M3 .
(iii) Determine the probability that it is from manufacturer M3 given that it is of
marginal grade; also determine the probability that it is from manufacturer M2
given that it is of acceptable grade.
3.28 In a 1984 report2 , the IRS published the information shown in the following
table regarding 89.9 million federal tax returns it received, the income bracket of
the lers, and the percentage audited.
Income
Bracket
Below $10, 000
$10, 000 $24, 999
$25, 000 $49, 999
$50, 000 and above

Number of
lers (millions)
31.4
30.7
22.2
5.5

Percent
Audited
0.34
0.92
2.05
4.00

(i) Determine the probability that a tax ler selected at random from this population
would be audited.
(ii) Determine the probability that a tax ler selected at random is in the $25, 000
$49, 999 income bracket and was audited.
(iii) If we know that a tax ler selected at random was audited, determine the
probability that this person belongs in the $50, 000 and above income bracket.

2 Annual Report of Commissioner and Chief Counsel, Internal Revenue Service, U.S.
Department of Treasury, 1984, p 60.

Random Phenomena

Chapter 4
Random Variables and Distributions

4.1

4.2

4.3
4.4

4.5

4.6

Introduction and Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.1.1 Mathematical Concept of the Random Variable . . . . . . . . . . . . . . . . . .
4.1.2 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.3 Types of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.3 The Probability Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Motivating the Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Denition and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Characterizing Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Moments of a Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.2 Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 Characteristic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.4 Additional Distributional Characteristics . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.5 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.6 Probability Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Special Derived Probability Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Survival Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.2 Hazard Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.3 Cumulative Hazard Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90
90
93
94
95
95
98
100
102
102
104
107
107
113
115
116
119
119
122
122
123
124
124
126
129
133

An idea, in the highest sense of that word,

cannot be conveyed but by a symbol.
S. T. Coleridge (1772-1834)

Even though the machinery of probability as presented thus far can already be
used to solve some practical problems, its development is far from complete.
In particular, with a sample space of raw outcomes that can be anything from
attributes and numbers, to letters and other sundry objects, this most basic
form of probability will be quite tedious and inecient in dealing with general
random phenomena. This chapter and the next one are devoted to completing
the development of the machinery of probability with the introduction of the
concept of the random variable, from which arises the probability distribution functionan ecient mathematical form for representing the ensemble
behavior of general random phenomena. The emergence, properties and characteristics of the probability distribution function are discussed extensively in
89

Random Phenomena

this chapter for single dimensional random variables; the discussion is generalized to multi-dimensional random variables in the next chapter.

4.1

Introduction and Definition

4.1.1

Mathematical Concept of the Random Variable

In general, the sample space presented thus far may be quite tedious
to describe and inecient to analyze mathematically if its elements are not
numbers. To facilitate mathematical analysis, it is desirable to nd a means
of converting this sample space into one with real numbers. This is achieved
via the vehicle of the random variable dened as follows:

Denition: Given a random experiment with a sample space ,

let there be a function X, which assigns to each element ,
one and only one real number X() = x. This function, X, is
called a random variable.

Upon the introduction of this entity, X, the following happens (See Fig
4.1):
1. is mapped onto V , i.e.
V = {x : X() = x, }

(4.1)

so that V is the set of all values x generated from X() = x for all
elements in the sample space ;
2. The probability set function encountered before, P , dened on , gives
rise to another probability set function, PX , dened on V and induced
by X. PX is therefore often referred to as an induced probability set
function.
The role of PX in V is identical to that of P in . Thus, for any arbitrary
subsect A of V , PX (A) is the probability of event A occurring.
The primary question of practical importance may now be stated as follows: How does one nd PX (A) in the new setting created by the introduction
of the random variable X, given the original sample space , and the original
probability set function P dened on it?
The answer is to go back to what we know, i.e., to nd that set A

Random Variables and Distributions

Z
X(Z) = x

FIGURE 4.1: The original sample space, , and the corresponding space V induced
by the random variable X
which corresponds to the set of values of in that are mapped by X into
A, i.e.
(4.2)
A = { : and X() A}
Such a set A is called the pre-image of A, that set on the original sample
space from which A is obtained when X is applied on its elements (see Fig
4.1). We now simply dene
PX (A) = P (A )

(4.3)

P {X() A} = P { A }

(4.4)

since, by denition of A ,

from where we see how X induces PX (.) from the known P (.). It is easy to
show that the induced PX is an authentic probability set function in the spirit
of Kolmogorovs axioms.
Remarks:
1. The random variable is X; the value it takes is the real number x. The
one is a completely dierent entity from the other.
2. The expression P (X = x) will be used to indicate the probability that
the application of the random variable X results in an outcome with
assigned value x; or, more simply, the probability that the random
variable X takes on a particular value x. As such, X = x should
not be confused with the familiar arithmetic statement of equality or
equivalence.
3. In many instances, the starting point is the space V and not the tedious
sample space , with PX (.) already dened so that there is no further
need for reference to a P (.) dened on .

Random Phenomena

Let us illustrate these concepts with some examples.

Example 4.1 RANDOM VARIABLE AND INDUCED PROBABILITY FUNCTION FOR COIN TOSS EXPERIMENT
The experiment in Example 3.1 in Chapter 3 involved tossing a coin 3
times and recording the number of observed heads and tails. From the
sample space obtained there, dene a random variable X as the total
number of tails obtained in the 3 tosses. (1) Obtain the new space V
and, (2) if A is the event that X = 2, determine the probability of this
events occurrence.
Solution:
(1) Recall from Example 3.1 that the sample space is given by
= {HHH, HHT, HT H, T HH, HT T, T HT, T T H, T T T }

(4.5)

consisting of all possible 8 outcomes, represented respectively, as i ; i =

1, 2, . . . , 8, i.e.
= {1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 }.

(4.6)

This is clearly one of the tedious types, not as conveniently amenable

to mathematical manipulation. And now, by denition of X, we see that
X(1 ) = 0; X(2 ) = X(3 ) = X(4 ) = 1; X(5 ) = X(6 ) = X(7 ) =
2; X(8 ) = 3, from where we now obtain the space V as:
V = {0, 1, 2, 3}

(4.7)

since these are all the possible values that X can take.
(2) To obtain PX (A), rst we nd A , the pre-image of A in . In this
case,
(4.8)
A = {5 , 6 , 7 }
so that upon recalling the probability set function P (.) generated in
Chapter 3 on the assumption of equiprobable outcomes, we obtain
P (A ) = 3/8, hence,
PX (A) = P (A ) = 3/8

(4.9)

The next two examples illustrate sample spaces that occur naturally in the
form of V .
Example 4.2 SAMPLE SPACE FOR SINGLE DIE TOSS EXPERIMENT
Consider an experiment in which a single die is thrown and the outcome
is the number that shows up on the dies top face when it comes to rest.
Obtain the sample space of all possible outcomes.
Solution:
The required sample space is the set {1, 2, 3, 4, 5, 6}, since this set
of numbers is an exhaustive collection of all the possible outcomes of
this experiment. Observe that this is a set of real numbers, so that it
is already in the form of V . We can therefore dene a probability set
function directly on it, with no further need to obtain a separate V and
an induced PX (.).

Random Variables and Distributions

Strictly speaking, this last example did involve an implicit application

of a random variable, if we acknowledge that the primitive outcomes for the
die toss experiment are actually dots on the top face of the die. However, by
pre-specifying the outcome as a count of the dots shown on the resting top
face, we simply skipped a step and went straight to the result of the application
of the random variable transforming the dots to the count. The next example
also involves similar die tosses, but in this case, the application of the random
variable is explicit, following the implicit one that automatically produces
numbers as the de-facto outcomes.
Example 4.3 SAMPLE SPACE AND RANDOM VARIABLE
FOR DOUBLE DICE TOSS EXPERIMENT
Consider an experiment in which two dice are thrown at the same time,
and the outcome is an ordered pair of the numbers that show up on
each dies top face after coming to rest. Assume that we are careful to
specify and distinguish a rst die (black die with white spots) from
the second (white die with black spots) (1) Obtain the sample space
of all possible outcomes. (2) Dene a random variable X as the sum of
numbers that show up on the two dice; obtain the new space V arising
as a result of the application of this random variable.
Solution:
(1) The original sample space, , of the raw outcomes, is given by
= {(1, 1), (1, 2), . . . (1, 6); (2, 1), (2, 2), . . . ; . . . , (6, 6)}

(4.10)

a set of the 36 ordered pairs of all possible outcomes (n1 , n2 ), where

n1 is the number showing up on the face of the rst die, and n2 is the
number on the second die. (Had it not been possible to distinguish a
rst die from the second, outcomes such as (2,1) and (1,2) could
not have been distinguishable, and will contain only 21 elements, the
6 diagonal and one set of the 15 o-diagonal elements of the 66 matrix
of ordered pairs.)
The elements of this set are clearly real numbers the 2dimensional kind and are already amenable to mathematical manipulation. The denition of a random variable X in this case is therefore
not for purposes of converting to a more mathematically convenient
form; the random variable denition is a reection of what aspect of the
experiment is of interest to us.
(2) By the denition of X, we see that the required space V is:
V = {2, 3, 4, . . . , 12}

(4.11)

a set containing 11 elements, a collection of all the possible values that

X can take in this case.

As an exercise, (see Exercise 4.7) the reader should compute the probability
PX (A) of the event A that X = 7, assuming equiprobable outcomes for each
die toss.

4.1.2

Random Phenomena

Practical Considerations

Rigor and precision are intrinsic to mathematics and mathematical analysis; without the former, the latter simply cannot exist. Such is the case with
the mathematical concept of the random variable as we have just presented
it: rigor demands that X be specied in this manner, as a function through
whose agency each element of the sample space of an experiment becomes
associated with an unambiguous numerical value. As illustrated in Fig 4.1, X
therefore appears as a mapping from one space, , that can contain all sorts
of raw objects, into one that is more conducive to mathematical analysis, V ,
containing only real numbers. Such a formal denition of the random variable
tends to appear sti, and almost sterile; and those encountering it for the rst
time may be unsure of what it really means in practice.
As a practical matter, the random variable may be considered (informally)
as an experimental outcome whose numerical value is subject to random variations with each exact replicate performance (trial) of the experiment. Thus,
for example, with the three coin-toss experiment discussed earlier, by specifying the outcome of interest as the total number of tails observed, we see
right away that the implied random variable can take on numerical values 0,
1, 2, or 3, even though the raw outcomes will consist of T s and Hs; also what
value the random variable takes is subject to random variation each time the
experiment is performed. In the same manner, we see that in attempting to
determine the temperature of an equilibrium mixture of ice and water, the observed temperature measurement in C takes on numerical values that vary
randomly around the number 0.

4.1.3

Types of Random Variables

A random variable can be either discrete or continuous, as determined

by the nature of the space V . For a discrete random variable, the space V
consists of isolated pointsisolated in the sense that, on the real line, every
neighborhood of each point contains no other point of V . For instance, in
Example 4.1 above, the random variable X can only take values 0, 1, 2, or 3;
it is therefore a discrete random variable.
On the other hand, the space V associated with a continuous random
variable consists of an interval of the real line, or, in higher dimensions, a set
of intervals. For example, let be dened as:
= { : 1 1}

(4.12)

If we dene a random variable X as:

X() = 1 ||,

(4.13)

observe that the random variable space V in this case is given by:
V = {x : 0 x 1}.

(4.14)

Random Variables and Distributions

This is an example of a continuous random variable.

Random variables can also be dened in higher dimensions. For example,
given a sample space with a probability set function P (.) dened on its
subsets, a two-dimensional random variable is a function dened on which
assigns one and only one ordered number pair (X1 (), X2 ()) to each element
. Associated with this random variable is a space V and a probability
set function PX induced by X = (X1 , X2 ), where V is dened as:
V = {(x1 , x2 ) : X1 () = x1 , X2 () = x2 ; }

(4.15)

The following is a simple example of a two-dimensional random variable.

Example 4.4 A 2-DIMENSIONAL RANDOM VARIABLE
AND ITS SAMPLE SPACE
Revisit Example 4.1 and the problem discussed therein involving tossing a coin 3 times and recording the number of observed heads and
tails; dene the following 2-dimensional random variable: X1 = total
number of tails; X2 = total number of heads. Obtain the sample space
V associated with this random variable.
Solution:
The required sample space in this case is:
V = {(0, 3), (1, 2), (2, 1), (3, 0)}.

(4.16)

Note that the two component random variables X1 and X2 are not
independent since their sum, X1 + X2 , by virtue of the experiment, is
constrained to equal 3 always.

What is noted briey here for two dimensions can be generalized to ndimensions, and the next chapter is devoted entirely to a discussion of multidimensional random variables.

4.2
4.2.1

Distributions
Discrete Random Variables

Let us return once more to Example 4.1 and, this time, for each element
of V , compute P (X = x), and denote this by f (x); i.e.
f (x) = P (X = x)

(4.17)

Observe that P (X = 0) = P (0 ) where 0 = {1 }, so that

f (0) = P (X = 0) = 1/8

(4.18)

Similarly, we obtain P (X = 1) = P (1 ) where 1 = {2 , 3 , 4 }, so that:

f (1) = P (X = 1) = 3/8

(4.19)

Random Phenomena

Likewise,
f (2) = P (X = 2) = 3/8

(4.20)

f (3) = P (X = 3) = 1/8

(4.21)

This function, f (x), indicates how the probabilities are distributed over the
entire random variable space.
Of importance also is a dierent, but related, function, F (x), dened as:
F (x) = P (X x)

(4.22)

the probability that the random variable X takes on values less than or equal
to x. For the specic example under consideration, we have: F (0) = P (X
0) = 1/8. As for F (1) = P (X 1), since the event A = {X 1} consists of
two mutually exclusive elementary events A0 = {X = 0} and A1 = {X = 1},
it then follows that:
F (1) = P (X 1) = P (X = 0) + P (X = 1) = 1/8 + 3/8 = 4/8

(4.23)

By similar arguments, we obtain:

F (2) = P (X 2) = P (X = 0) + P (X = 1) + P (X = 2) = 7/8 (4.24)
F (3) = P (X 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
= 8/8

(4.25)

These results are tabulated in Table 4.1.

TABLE 4.1:
f (x) and F (x) for
the three coin-toss
experiments of
Example 4.1
x
f (x) F (x)
0
1/8
1/8
1
3/8
4/8
2
3/8
7/8
3
1/8
8/8
The function, f (x), is referred to as the probability distribution function
(pdf), or sometimes as the probability mass function; F (x) is known as the
cumulative distribution function, or sometimes simply as the distribution function.
Note, once again, that X can assume only a nite number of discrete
values, in this case, 0, 1, 2, or 3; it is therefore a discrete random variable,
and both f (x) and F (x) are discrete functions. As shown in Fig 4.2, f (x) is
characterized by non-zero spikes at values of x = 0, 1, 2 and 3, and F (x) by
the indicated staircase form.

Random Variables and Distributions

0.40

1.0

0.35
0.8
0.30
F(x)

f(x)

0.6
0.25

0.4
0.20
0.2

0.15

0.0

0.10
0.0

0.5

1.0

1.5
x

2.0

2.5

3.0

2
x

FIGURE 4.2: Probability distribution function, f (x), and cumulative distribution function, F (x), for 3-coin toss experiment of Example 4.1

Let x0 = 0, x1 = 1, x2 = 2, x3 = 3; then
P (X = xi ) = f (xi ) for i = 0, 1, 2, 3

(4.26)

with f (xi ) given explicitly as:

1/8;

3/8;
f (xi ) =
3/8;

1/8;

x0
x1
x2
x3

=0
=1
=2
=3

(4.27)

and the two functions in Table 4.1 are related explicitly according to the
following expression:

F (xi ) =

f (xj )

(4.28)

j=0

We may now also note the following about the function f (xi ):
f (xi ) > 0; xi

3
i=0

f (xi ) = 1

These ideas may now be generalized beyond the specic example used above.

Random Phenomena
Denition: Let there exist a sample space (along with a probability set function, P , dened on its subsets), and a random variable X, with an attendant random variable space V : a function f
dened on V such that:
1. f (x) 0; x V ;

x f (x) = 1; x V ;

3. PX (A) = xA f (x); for A V (and when A contains the
single element xi , PX (X = xi ) = f (xi ))

is called a probability distribution function of the random variable

Upon comparing these formal statements regarding f (x) to the 3 axioms

of Kolmogorov (regarding the probability set function P dened on ) given
earlier in Chapter 3, we readily see that these are the same concepts extended
from to V for the random variable X.

4.2.2

Continuous Random Variables

For the continuous random variable X, because it takes on a continuum

of values, not discrete points as with the discrete counterpart, the concepts
presented above are modied as follows, primarily by replacing sums with
integrals:

Denition: The function f dened on the space V (whose elements consist of segments of the real line) such that:
1. f (x) 0; x V ;
2. f has at most a nite number of discontinuities in every nite
interval;

3. The (Riemann) integral, V f (x)dx = 1;

4. PX (A) = A f (x)dx; for A V
is called a probability density function of the continuous random
variable X.

(The second point above, unnecessary for the discrete case, is a mathematical ne point needed to safeguard against pathological situations where the

Random Variables and Distributions

probability measure becomes undened; it is hardly ever an issue in most

practical applications.)
In this case, the expression for the cumulative distribution function, F (x),
corresponding to that in Eq (4.28), is:
xi
F (xi ) = P (X xi ) =
f (x)dx
(4.29)

from where we may now observe that when F (x) possesses a derivative,
dF (x)
= f (x)
dx

(4.30)

This f (x) is the continuous counterpart of the discrete f (x) encountered earlier; but rather than express the probability that X takes on a particular point
value xi (as in the discrete case), the continuous f (x) expresses a measure of
the probability that X lies in the innitesimal interval between xi and xi + dx.
Observe, from item 4 in the denition given above, that:

P (xi X xi + dx) =

xi +dx

f (x)dx f (xi )dx

(4.31)

for a very small interval size dx.

In general, because the event A = {X x + dx} can be decomposed into
2 mutually exclusive events B = {x X} and C = {x X x + dx}, so
that:
{X x + dx} = {x X} {x X x + dx},
(4.32)
we see that:
P (X x + dx)
F (x + dx)

=
=

P (x X) + P (x X x + dx)
F (x) + P (x X x + dx)

(4.33)

and therefore:
P (x X x + dx) = F (x + dx) F (x)

(4.34)

which, upon introducing Eq (4.31) for the LHS, dividing by dx, and taking
limits as dx 0, yields:

F (x + dx) F (x)
dF (x)
lim
= f (x)
(4.35)
=
dx0
dx
dx
establishing Eq (4.30).
In general, we can use Eq (4.29) to establish that, for any arbitrary b a,

P (a X b) =

f (x)dx = F (b) F (a).

(4.36)

100

Random Phenomena

For the sake of completeness, we note that F (x), the cumulative distribution function, is actually the more fundamental function for determining
probabilities. This is because, regardless of whether X is continuous or discrete, F (.) can be used to determine all desired probabilities. Observe from
the foregoing discussion that the expression
P (a1 < X a2 ) = F (a2 ) F (a1 )

(4.37)

will hold true whether X is continuous or discrete.

From now on in this book, we will simply talk of the probability distribution function (or pdf for short) for all random variables X (continuous or
discrete) and mean by this, probability distribution function if X is discrete,
and probability density function if X is continuous and expect that the context
will make clear what we mean.

4.2.3

The Probability Distribution Function

We have now seen that the pdf f (x) (or equivalently, the cdf F (x)) is the
function that indicates how the probabilities of occurrence of various outcomes
and events arising from the random phenomenon in question are distributed
over the entire space of the associated random variable X.
Let us return once more to the three coin-toss example: we understand
that the random phenomenon in question is such that we cannot predict, `
apriori, the specic outcome of each experiment; but from the ensemble aggregate of all possible outcomes, we have been able to characterize, with f (x), the
behavior of an associated random variable of interest, X, the total number
of tails obtained in the experiment. (Note that other random variables could
also be dened for this experiment: for example, the total number of heads,
or the number of tosses until the appearance of the rst head, etc.) What
Table 4.1 provides is a complete description of the probability of occurrence
for the entire collection of all possible events associated with this random
variablea description that can now be used to analyze the particular random phenomenon of the total number of tails observed when a coin is tossed
three times.
For instance, the pdf f (x) indicates that, even though we cannot predict a
specic outcome precisely, we now know that after each experiment, observing
no tails (X = 0) is just as likely as observing all tails (X = 3), each with
a probability of 1/8. Also, observing two tails is just as likely as observing
one tail, each with a probability of 3/8, so that these latter group of events
are three times as likely as the former group of events. Note the symmetry of
the distribution of probabilities indicated by f (x) for this particular random
phenomenon.
It turns out that these specic results can be generalized for the class of
random phenomena to which the three coin-toss example belongs a class
characterized by the following features:

Random Variables and Distributions

101

1. each experiment involves n identical trials (e.g. coin tosses, or number

of fertilized embryos transferred in an in-vitro fertilization (IVF) treatment cycle, etc), and each trial can produce only two mutually exclusive
outcomes: S (success) or F (failure);
2. the probability of success in each trial is p; and,
3. the outcome of interest, X, is the total of successes observed (e.g. tails
in coin tosses, live births in IVF, etc).
As we show a bit later, the pdf characterizing this family of random phenomena is given by:
f (x) =

n!
px (1 p)nx ; x = 0, 1, 2, . . . , n
x!(n x)!

(4.38)

The results in Table 4.1 are obtained for the special case n = 3; p = 0.5.
Such functions as these provide convenient and compact mathematical
representations of the desired ensemble behavior of random variables; they
constitute the centerpiece of the probabilistic framework the fundamental
tool used for analyzing random phenomena.
We have, in fact, already encountered in earlier chapters, several actual
pdfs for some real-world random variables. For example, we had stated in
Chapter 1 (thus far without justication) that the continuous random variable
representing the yield obtained from the example manufacturing processes has
the pdf:
(x)2
1
f (x) = e 22 ; < x <
(4.39)
2
We are able to use this pdf to compute the probabilities of obtaining yields
in various intervals on the real line for the two contemplated processes, once
the parameters and are specied for each process.
We had also stated in Chapter 1 that, for the (discrete) random variable
X representing the number of inclusions found on the manufactured glass
sheet, the pdf is:
e x
; x = 0, 1, 2, . . .
(4.40)
f (x) =
x!
from which, again, given a specic value for the parameter , we are able to
compute the probabilities of nding any given number of inclusions on any
selected glass sheet. And in Chapter 2, we showed, using chemical engineering
principles, that the pdf for the (continuous) random variable X, representing
the residence time in an ideal CSTR, is given by:
f (x) =

1 x/
e
;0 < x <

(4.41)

an expression that is used in practice for certain aspects of chemical reactor

design.

102

Random Phenomena

These pdfs are all ideal models of the random variability associated with
each of the random variables in question; they make possible rigorous and
precise mathematical analyses of the ensemble behavior of the respective random phenomena. Such mathematical representations are systematically derived for actual, specic real-world phenomena of practical importance in Part
III, where the resulting pdfs are also discussed and analyzed extensively.
The rest of this chapter is devoted to taking a deeper look at the fundamental characteristics and general properties of the pdf, f (x), for singledimensional random variables; the next chapter is devoted to a parallel treatment for multi-dimensional random variables.

4.3

Mathematical Expectation

We begin our investigations into the fundamental characteristics of a random variable, X, and its pdf, f (x), with one of the most important: the
mathematical expectation or expected value. As will soon become clear,
the concept of expectations of random variables (or functions of random variables) is of signicant practical importance; but before giving a formal denition, we rst provide a motivation and an illustration of the concept.

4.3.1

Motivating the Denition

Consider a game where each turn involves a player drawing a ball at random from a black velvet bag containing 9 balls, identical in every way except
that 5 are red, 3 are blue and one is green. The player receives $1.00 for
drawing a red ball, $4.00 for a blue ball, and $10.00 for the green ball, but
each turn at the game costs $4.00 to play. The question is: Is this game worth
playing?
The primary issue, of course, is the random variation in the color of the
drawn ball each time the game is played. Even though simple and somewhat
articial, this example provides a perfect illustration of how to solve problems
involving random phenomena using the probabilistic framework.
To arrive at a rational decision regarding whether to play this game or
not, we proceed as follows, noting rst the following characteristics of the
phenomenon in question:
Experiment : Draw a ball at random from a bag containing 9 balls composed as given above; note the color of the drawn ball, then replace the
ball;
Outcome: The color of the drawn ball: R = Red; B = Blue; G = Green.
Probabilistic Model Development

Random Variables and Distributions

103

TABLE 4.2:
The pdf f (x) for
the ball-drawing
game
x
f (x)
1
5/9
4
3/9
10
1/9

From the problem denition, we see that the sample space is given by:
= {R, R, R, R, R, B, B, B, G}

(4.42)

The random variable, X, is clearly the monetary value assigned to the outcome
of each draw; i.e. in terms of the formal denition, X assigns the real number
1 to R, 4 to B, and 10 to G. (Informally, we could just as easily say that X is
the amount of money received upon each draw.) The random variable space
V is therefore given by:
V = {1, 4, 10}
(4.43)
And now, since there is no reason to think otherwise, we assume that each
outcome is equally probable, in which case the probability distribution for the
random variable X is obtained as follows:
PX (X = 1) = P (R) =
PX (X = 4) = P (B)
PX (X = 10) = P (G)

=
=

5/9

(4.44)

3/9
1/9

(4.45)
(4.46)

so that f (x), the pdf for this discrete random variable, is as shown in the
Table 4.2, or, mathematically as:

5/9; x1 = 1

3/9; x2 = 4
(4.47)
f (xi ) =
1/9; x3 = 10

0;
otherwise
This is an ideal model of the random phenomenon underlying this game; it
will now be used to analyze the problem and to decide rationally whether to
play the game or not.
Using the Model
We begin by observing that this is a case where it is possible to repeat the
experiment a large number of times; in fact, this is precisely what the person
setting up the game wants each player to do: play the game repeatedly! Thus,
if the game is played a very large number of times, say n, it is reasonable from
the model to expect 5n/9 red ball draws, 3n/9 blue ball draws, and n/9 green

104

Random Phenomena

ball draws; the corresponding nancial returns will be $(5n/9), $(4 3n/9),
and $(10 n/9), respectively, in each case.
Observe now that after n turns at the game, we would expect the total
nancial returns in dollars, say Rn , to be:
Rn =

3n
n
5n
+4
+ 10
1
= 3n
9
9
9

(4.48)

These results are summarized in Table 4.3.

TABLE 4.3:

Summary analysis for the ball-drawing

game
Ball
Color

Expected # of Financial
times drawn
returns
(after n trials) per draw
Red
5n/9
1
Blue
3n/9
4
n/9
10
Green
Total

Expected nancial
returns
(after n trials)
$5n/9
$12n/9
$10n/9
3n

In the meantime, the total cost Cn , the amount of money, in dollars, paid
out to play the game, would have been 4n. On the basis of these calculations,
therefore, the expected net gain (in dollars) after n trials, Gn , is given by
Gn = Rn Cn = n

(4.49)

indicating a net loss of $n, so that the rational decision is not to play the
game. (The house always wins!)
Eq (4.48) implies that the expected return per draw will be:
Rn
=
n

5
3
1
1 + 4 + 10
= 3,
9
9
9

(4.50)

a sum of all possible values of the random variable X, weighted by their

corresponding probabilities, i.e. from Eq (4.47),
3

Rn
=
xi f (xi )
n
i=1

(4.51)

This quantity is known as the expected value, or the mathematical expectation

of the random variable X a weighted average of the values taken by X with
the respective probabilities of obtaining each value as the weights.
We are now in a position to provide a formal denition of the mathematical
expectation.

Random Variables and Distributions

4.3.2

105

Denition and Properties

The expected value, or mathematical expectation, of a random variable,

denoted by E(X), is dened for a discrete random variable as:

xi f (xi )
(4.52)
E(X) =
i

and for a continuous random variable,

xf (x)dx
E(X) =

(4.53)

provided that the following conditions hold:

|xi |f (xi ) <

(4.54)

(known as absolute convergence) for discrete X, and

|x|f (x)dx <

(4.55)

(absolute integrability) for continuous X. If these conditions are not satised,

then E(X) does not exist for X.
If the pdf is interpreted as an assignment of weights to point values of a
discrete random variable, or intervals of a continuous random variable, then
observe that E(X) is that value of the random variable that is the center of
gravity of the distribution.
Some important points to note about the mathematical expectation:
1. E(X) is not a random variable; it is an exactly dened real number;
2. When X has units, E(X) has the same units as X;
3. E(X) is often called the mean value of the random variable (or equivalently, of its distribution f (x)), represented as (X) or simply , thus:
E(X) = (X)

(4.56)

Example 4.5 EXPECTED VALUE OF TWO DISCRETE

RANDOM VARIABLES
(1) Find the expected value of the random variable, X, the total number of tails observed in the three coin-toss experiment, whose pdf f (x)
is given in Table 4.1 and in Eq (4.27).
(2) Find the expected value of the random variable, X, the nancial
returns on the ball-draw game, whose pdf f (x) is given in Eq (4.47).
Solution:
(1) From the denition of E(X), we have in this case,
E(X) = (0 1/8 + 1 3/8 + 2 3/8 + 3 1/8) = 1.5

(4.57)

106

Random Phenomena
indicating that with this experiment, the expected, or average, number
of tails per toss is 1.5, which makes perfect sense.
(2) The expected nancial return for the ball-draw game is obtained
formally from Eq (4.47) as:
E(X) = (1 5/9 + 4 3/9 + 10 1/9) = 3.0

(4.58)

as we had obtained earlier.

Example 4.6 EXPECTED VALUE OF TWO CONTINUOUS
RANDOM VARIABLES
(1) Find the expected value of the random variable, X, whose pdf f (x)
is given by:
1
2 x; 0 < x < 2
(4.59)
f (x) =

0;
otherwise
(2) Find the expected value of the random variable, X, the residence
time in a CSTR, whose pdf f (x) is given in Eq (4.41).
Solution:
(1) First, we observe that Eq (4.59) is a legitimate pdf because
2
2

1
1
f (x)dx =
(4.60)
xdx = x2 = 1
4

0 2
0
and, by denition,
2

1 2 2
1
4
E(X) =
xf (x)dx =
x dx = x3 =
2 0
6
3

0
(2) In the case of the residence time,

1 x/
1 x/
xe
E(X) =
dx =
xe
dx
0

(4.61)

(4.62)

since the random variable X, residence time, takes no negative values.

Upon integrating the RHS by parts, we obtain,

ex/ dx = 0 ex/ =
(4.63)
E(X) = xex/ +
0

indicating that the expected, or average, residence time is the reactor

parameter , providing justication for why this parameter is known in
chemical reactor design as the mean residence time.

An important property of the mathematical expectation of a random variable X is that for any function of this random variable, say G(X),

for discrete X

i G(xi )f (xi );
E[G(X)] =
(4.64)

G(x)f
(x);
for
continuous
X

provided that the conditions of absolute convergence and absolute integrability

stated earlier for X in Eqs (4.54) and (4.55), respectively, hold for G(X).

Random Variables and Distributions

107

In particular, if G(X) is a linear function, say for example,

G(X) = c1 X + c2

(4.65)

where c1 and c2 are constants, then from Eq (4.64) above, in the discrete case,
we have that:

(c1 xi + c2 )f (xi )
E(c1 X + c2 ) =
i

xi f (xi ) + c2

f (xi )

c1 E(X) + c2

(4.66)

so that:
E(c1 X + c2 ) = c1 E(X) + c2

(4.67)

Thus, treated like an operator, E(.) is a linear operator. Similar arguments

follow for the continuous case, replacing sums with appropriate integrals (see
end-of-chapter Exercise 4.12).

4.4

Characterizing Distributions

One of the primary utilities of the result in Eq (4.64) is for obtaining

certain useful characteristics of the pdf f (x) by investigating the expectations
of special cases of G(X).

4.4.1

Moments of a Distributions

Consider rst the case where G(X) in Eq (4.64) is given as:

G(X) = X k

(4.68)

for any integer k. The expectation of this function is known as the k th (ordinary) moment of the random variable X (or, equivalently, the k th (ordinary)
moment of the pdf, f (x)), dened by:
mk = E[X k ]

(4.69)

First (Ordinary) Moment: Mean

Observe that m0 = 1 always for all random variables, X, and provided that
E[|X|k ] < , then the other k moments exist; in particular, the rst moment
m1 = E(X) =

(4.70)

Thus, the expected value of X, E(X), is also the same as the rst (ordinary)

108

Random Phenomena

moment of X (or, equivalently, of the pdf f (x)).

Central Moments
Next, consider the case where G(X) in Eq (4.64) is given as:
G(X) = (X a)k

(4.71)

for any constant value a and integer k. The expectation of this function is
known as the k th moment of the random variable X about the point a (or,
equivalently, the k th moment of the pdf, f (x), about the point a). Of particular
interest are the moments about the mean value , dened by:
k = E[(X )k ]

(4.72)

known as the central moments of the random variable X (or of the pdf, f (x)).
Observe from here that 0 = 1, and 1 = 0, always, regardless of X or ; these
therefore provide no particularly useful information regarding the characteristics of any particular X. However, provided that the conditions of absolute
convergence and absolute integrability hold, the higher central moments exist
and do in fact provide very useful information about the random variable X
and its distribution.
Second Central Moment: Variance
Observe from above that the quantity
2 = E[(X )2 ]

(4.73)

is the lowest central moment of the random variable X that contains any
meaningful information about the average deviation of a random variable
from its mean value. It is called the variance of X and is sometimes represented
as 2 (X). Thus,

Note that

2 = E[(X )2 ] = V ar(X) = 2 (X).

(4.74)

2 (X) = E[(X )2 ] = E(X 2 2X + 2 )

(4.75)

so that by the linearity of the E[.] operator, we obtain:

2 (X) = E(X 2 ) 2 = E(X 2 ) [E(X)]2

(4.76)

or, in terms of the ordinary moments,

2 (X) = m2 2

(4.77)

It is easy to verify the following important properties of V ar(X):

1. For constant b,
V ar(b) = 0

(4.78)

Random Variables and Distributions

109

2. For constants a and b,

V ar(aX + b) = a2 V ar(X)

(4.79)

The positive square root of 2 is called the standard deviation of X, and

naturally represented by ; it has the same units as X. The ratio of the
standard deviation to the mean value of a random variable, known as the
coecient of variation Cv , i.e.,
Cv =

(4.80)

provides a dimensionless measure of the relative amount of variability displayed by the random variable.
Third Central Moment: Skewness
The third central moment,
3 = E[(X )3 ]

(4.81)

is called the skewness of the random variable; it provides information about

the relative dierence that exists between negative and positive deviations
from the mean. It is therefore a measure of asymmetry. The dimensionless
quantity
3
(4.82)
3 = 3

known as the coecient of skewness, is often the more commonly used measure
precisely because it is dimensionless. For a perfectly symmetric distribution,
negative deviations from the mean exactly counterbalance positive deviations,
and both 3 and 3 vanish.
When there are more values of X to the left of the mean than to the
right, (i.e. when negative deviations from the mean dominate), 3 < 0 (as is
3 ), and the distribution is said to skew left or is negatively skewed. Such
distributions will have long left tails, as illustrated in Fig 4.3. An example
random variable with this characteristic is the gasoline-mileage (in miles per
gallon) of cars in the US. While many cars get relatively high gas-mileage,
there remains a few classes of cars (SUVs, Hummers, etc) with gas-mileage
much worse than the ensemble average. It is this latter class that contribute
to the long left tail.
On the other hand, when there are more values of X to the right of the
mean than to the left, so that positive deviations from the mean dominate,
both 3 and 3 are positive, and the distribution is said to skew right
or is positively skewed. As one would expect, such distributions will have
long right tails (see Fig 4.4). An example of this class of random variables
is the household income/net-worth in the US. While the vast majority of
household incomes/net-worth are moderate, the few truly super-rich whose
incomes/net-worth are a few orders of magnitude larger than the ensemble

110

Random Phenomena

3.0

2.5

f(x)

2.0

1.5

1.0

0.5

0.0

0.2

0.4

0.6

0.8

1.0

FIGURE 4.3: Distribution of a negatively skewed random variable

3.0

2.5

f(x)

2.0

1.5

1.0

0.5

0.0

0.2

0.4

0.6

0.8

1.0

FIGURE 4.4: Distribution of a positively skewed random variable

Random Variables and Distributions

111

0.4

f(x)

0.3

0.2

0.1

0.0

-3

-2

-1

0
X

FIGURE 4.5: Distributions with reference kurtosis (solid line) and mild kurtosis (dashed
line)

average contribute to the long right tail.

Fourth Central Moment: Kurtosis
The fourth central moment,
4 = E[(X )4 ]

(4.83)

is called the kurtosis of the random variable. Sometimes, it is the dimensionless

version
4
(4.84)
4 = 4 ,

technically known as the coecient of kurtosis, that is simply called the kurtosis. Either quantity is a measure of how peaked or at a probability distribution is. A high kurtosis random variable has a distribution with a sharper
peak and thicker tails; the low kurtosis random variable on the other hand
has a distribution with a more rounded, atter peak, with broader shoulders.
For reasons discussed later, the value 4 = 3 is the accepted normal
reference for kurtosis, so that distributions for which 4 < 3 are said to be
platykurtic (mildly peaked) while those for which 4 > 3 are said to be leptokurtic (sharply peaked). Figures 4.5 and 4.6 show a reference distribution
with kurtosis 4 = 3, in the solid lines, compared to a distribution with mild
kurtosis (actually 4 = 1.8) (dashed line in Fig 4.5), and a distribution with
high kurtosis (dashed line in Fig 4.6).
Practical Applications
Of course, it is possible to compute as many moments (ordinary or central) of

112

Random Phenomena

0.4

f(x)

0.3

0.2

0.1

0.0

-10

-5

0
X

FIGURE 4.6: Distributions with reference kurtosis (solid line) and high kurtosis (dashed
line)

a distribution as we wish and we shall shortly present a general expression

from which one can generate all such moments; but the four specically singled
out above have been the most useful for characterizing random variables and
their distributions, in practice. They tell us much about the random variable
we are dealing with.
The rst (ordinary) moment, m1 or , tells us about the location of the
center of gravity (centroid) of the random variable, its mean value; and, as
we show later, it is a popular candidate for the single value most representative
of the ensemble. The second central moment, 2 or 2 , the variance, tells us
how tightly clustered or broadly dispersed the random variable is around
its mean. The third central moment, 3 , the skewness, tells us whether lower
extreme values of the random variable are farther to the left of the centroid
(the ensemble average) than the higher extreme values are to the right (as is
the case with automobile gas-mileage in the US), or vice versa, with higher
extreme values signicantly farther to the right of the centroid than the lower
extreme values (as is the case with household incomes/net worth in the US).
Just like the third central moment tells us how much of the average deviation from the mean is due to infrequent extreme values, the fourth central
moment, 4 (the kurtosis) tells us how much of the variance is due to infrequent extreme deviations. With sharper peaks and thicker tails, extreme
values in the tails contribute more to the variance, and the kurtosis is high (as
in Fig 4.6); with atter peaks and very little in terms of tails (as in Fig 4.5),
there will be more contributions to the variance from central values, which
naturally show modest deviations from the mean, and very little contribution
from the extreme values; the kurtosis will therefore be lower.

Random Variables and Distributions

113

Finally, we note that moments of a random variable are not merely interesting theoretical characteristics; they have signicant practical applications.
For example, polymers, being macromolecules with non-uniform molecular
weights (because random events occurring during the manufacturing process
ensure that polymer molecules grow to varying sizes) are primarily characterized by their molecular weight distributions (MWDs). Not surprisingly, therefore, the performance of a polymeric material depends critically on its MWD:
for instance, with most elastomers, a narrow distribution (very low second
central moments) is associated with poor processing but superior mechanical
properties.
MWDs are so important in polymer chemistry and engineering that a wide
variety of analytical techniques have been developed for experimental determination of the MWD and the following special molecular weight averages
that are in common use:
1. Mn , the number average molecular weight, is the ratio of the rst (ordinary) moment to the zeroth ordinary moment. (In polymer applications,
the MWD, unlike a pdf f (x), is not normalized to sum or integrate to
1. The zeroth moment of the MWD is therefore not 1; it is the total
number of molecules present in the sample of interest.)
2. Mw , the weight average molecular weight, is the ratio of the second
moment to the rst moment; and
3. Mz , the so-called z average molecular weight, is the ratio of the third
moment to the second.
One other important practical characteristic of the polymeric material is its
polydispersity index, PDI, the ratio of Mw to Mn . A measure of the breadth
of the MWD, it is always > 1 and approximately 2 for most linear polymers;
for highly branched polymers, it can be as high as 20 or even higher.
What is true of polymers is also true of particulate products such as granulated sugar, or fertilizer granules sold in bags. These products are made up
of particles with non-uniform sizes and are characterized by their particle size
distributions. The behavior of these products, whether it is their ow characteristics, or how they dissolve in solution, are determined by the moments of
these distributions.

4.4.2

Moment Generating Function

When G(X) in Eq (4.64) is given as:

G(X) = etX
the expectation of this function, when it exists, is the function:
tx
i
f (xi );
for discrete X;

ie
MX (t) =
tx
e f (x)dx; for continuous X

(4.85)

(4.86)

114

Random Phenomena

a function of the real-valued variable, t, known as the moment generating

function (MGF) of X. MX (t) is so called because all the (ordinary) moments
of X can be generated from it as follows:
By denition,

MX (t) = E etX
(4.87)
and by dierentiating with respect to t, we obtain,

d tX
d

MX
e
= E
(t) = E etX
dt
dt
tX
= E Xe

(4.88)

(The indicated swapping of the order of the dierentiation and expectation

operators is allowed under conditions that essentially imply the existence of
the moments.) From here we easily obtain, for t = 0, that:

(0) = E(X) = m1
MX

(4.89)

the rst (ordinary) moment. Similarly, by dierentiating once more, we obtain:

MX
(t) =

so that, for t = 0,

d tX
E Xe
= E X 2 etX
dt

MX
(0) = E X 2 = m2

(4.90)

(4.91)

and in general, after n such dierentiations, we obtain

(n)

MX (0) = E[X n ] = mn

(4.92)

Now, it is also possible to establish this result by considering the following

Taylor series expansion about the point t = 0,
etX = 1 + Xt +

X2 2 X3 3
t +
t +
2
3!

(4.93)

Clearly, this innite series converges only under certain conditions. For those
random variables, X, for which the series does not converge, MX (t) does not
exist; but when it exists, this series converges, and by repeated dierentiation
of Eq (4.93) with respect to t, followed by taking expectations, we are then
able to establish the result in Eq (4.92).
The following are some important properties of the MGF.
1. Uniqueness: The MGF, MX (t), does not exist for all random variables,
X; but when it exists, it uniquely determines the distribution, so that if
two random variables have the same MGF, they have the same distribution. Conversely, random variables with dierent MGFs have dierent
distributions.

Random Variables and Distributions

115

2. Linear Transformations: If two random variables Y and X are related

according to the linear expression:
Y = aX + b

(4.94)

MY (t) = ebt MX (at)

(4.95)

for constant a and b, then:

3. Independent Sums: For independent random variables X and Y with

respective MGFs MX (t), and MY (t), the MGF of their sum Z = X + Y
is:
(4.96)
MZ (t) = MX+Y (t) = MX (t)MY (t)
Example 4.7 MOMENT GENERATING FUNCTION OF A
CONTINUOUS RANDOM VARIABLE
Find the MGF MX (t), for the random variable, X, the residence time
in a CSTR, whose pdf is given in Eq (4.41).
Solution:
In this case, the required MX (t) is given by:

1 tx x/
1 (1 t)x
e e
dx =
e
dx (4.97)
MX (t) = E etX =
0
0
Upon integrating the RHS appropriately, we obtain,

(1 t)x
1
MX (t) =
e
1 t
0
1
=
1 t

(4.98)
(4.99)

From here, one easily obtains: m1 = ; m2 = 2 , . . . , mk = k .

4.4.3

Characteristic Function

As alluded to above, the MGF does not exist for all random variables, a
fact that sometimes limits its usefulness. However, a similarly dened function,
the characteristic function, shares all the properties of the MGF but does not
suer from this primary limitation: it exists for all random variables.
When G(X) in Eq (4.64) is given as:
G(X) = ejtX
(4.100)

where j is the complex variable (1), then the function of the real-valued
variable t dened as,

(4.101)
X (t) = E ejtX

116

Random Phenomena

i.e.
X (t) =

jtx
i
f (xi );

for discrete X;

for continuous X

ejtx f (x)dx;

(4.102)

is known as the characteristic function (CF) of the random variable X.

Because of the denition of the complex exponential, whereby
ejtX = cos(tX) + j sin(tX)
observe that

(4.103)

jtX
e = cos2 (tX) + sin2 (tX) = 1

(4.104)
jtX

= 1 < , always, regardless of X, with the direct impliso that E e
cation that X (t) = E(ejtX ) always exists for all random variables. Thus,
anything one would have typically used the MGF for (e.g., for deriving limit
theorems in advanced courses in probability), one can always substitute the
CF when the MGF does not exist.
The reader familiar with Laplace transforms and Fourier transforms will
probably have noticed the similarities between the former and the MGF (see
Eq (4.86)), and between the latter and the CF (see Eq (4.102)). Furthermore,
the relationship between these two probability functions are also reminiscent
of the relationship between the two transforms: not all functions have Laplace
transforms; the Fourier transform, on the other hand, does not suer such
limitations.
We now state, without proof, that given the expression for the characteristic function in Eq (4.102), there is a corresponding inversion formula whereby
f (x) is recovered from X (t), given as follows:

b jtx
1
e
X (t)dt; for discrete X;
limb 2b
b
(4.105)
f (x) =
1 jtx
e

(t)dt;
for
continuous
X
X
2
In fact, the two sets of equations, Eqs (4.102) and (4.105), are formal Fourier
transform pairs precisely as in other engineering applications of the theory of
Fourier transforms. These transform pairs are extremely useful in obtaining
the pdfs of functions of random variables, most especially sums of random
variables. As with classic engineering applications of the Fourier (and Laplace)
transform, the characteristic functions of the functions of independent random
variables in question are obtained rst, being easier to obtain directly than
the pdfs; the inversion formula is subsequently invoked to recover the desired
pdfs. This strategy is employed at appropriate places in upcoming chapters.

4.4.4

Additional Distributional Characteristics

Apart from the mean, variance and other higher moments noted above,
there are other characteristic attributes of importance.

Random Variables and Distributions

0.4

117

x* = 1 (Mode)

f(x)

0.3

0.2

0.1

0.0

4
X

FIGURE 4.7: The pdf of a continuous random variable X with a mode at x = 1

Mode
The mode, x , of a distribution is that value of the random variable for which
the pdf achieves a (local) maximum. For a discrete random variable, it is
the value of X that possesses the maximum probability (the most popular
value); i.e.
(4.106)
arg max{P (X = x)} = x
x

For a continuous random variable with a dierentiable pdf, it is the value of

x for which
df (x)
d2 f (x)
= 0;
<0
(4.107)
dx
dx2
as shown in Fig 12.21. A pdf having only one such maximum value is said to
be unimodal ; if more than one such maximum value exists, the distribution is
said to be multimodal.
Median
The median of a distribution is that mid-point value xm for which the cumulative distribution is exactly 1/2, i.e.
F (xm ) = P (X < xm ) = P [X > xm ] = 0.5
For a continuous random variable, xm is the value for which
xm

f (x)dx =
f (x)dx = 0.5

(4.108)

(4.109)

(For the discrete random variable, replace the integral above with appropriate

118

Random Phenomena

100

F(x); Percent

60
50
40
25

2.140

1.58

1.020

FIGURE 4.8: The cdf of a continuous random variable X showing the lower and upper
quartiles and the median
sums.) Observe therefore that the median, xm , divides the total range of the
random variable into two parts with equal probability.
For a symmetric unimodal distribution, the mean, mode and median coincide; they are dierent for asymmetric (skewed) distributions.
Quartiles
The concept of a median, which divides the cdf at the 50% point, can be
extended to other values indicative of other fractional sectioning o of the
cdf. Thus, by referring to the median as x0.5 , or x50 , we are able to dene, in
the same spirit, the following values of the random variable, x0.25 and x0.75
(or, in terms of percentages, x25 and x75 respectively) as follows:
F (x0.25 ) = 0.25

(4.110)

that value of X below which a quarter of the population resides; and

F (x0.75 ) = 0.75

(4.111)

the value of X below which lies three quarters of the population. These values
are known respectively as the lower and upper quartiles of the distribution
because, along with the median x0.5 , these values divide the population into
four quarters, each part with equal probability.
These concepts are illustrated in Fig 4.8 where the lower quartile is located
at x = 1.02; the median at x = 1.58 and the upper quartile at x = 2.14. Thus,
for this particular example, P (X < 1.02 = 0.25); P (1.02 < X < 1.50) =
0.25; P (1.58 < X < 2.14) = 0.25 and P (X > 1.58) = 0.25.

Random Variables and Distributions

119

There is nothing restricting us to dividing the population in halves (median) or in quarters (quartiles); in general, for any 0 < q < 1, the q th quantile
is dened as that value xq of the random variable for which
xq
F (xq ) =
f (x)dx = q
(4.112)

for a continuous random variable (with the integral replaced by the appropriate sum for the discrete random variable).
This quantity is sometimes dened instead in terms of percentiles, in which
case, the q th quantile is simply the 100q percentile. Thus, the median is equivalently the half quantile, the 50th percentile, or the second quartile.

4.4.5

Entropy

A concept to be explored more completely in Chapter 10 is concerned with

quantifying the information content contained in the statement, X = x,
i.e. that the (discrete) random variable X has been observed to take on the
specic value x. Whatever this information content is, it will clearly be related
to the pdf, f (x); in fact, it has been shown to be dened as:
I[f (x)] = log2 f (x)

(4.113)

Now, when G(X) in Eq (4.64) is dened as:

G(X) = log2 f (x)
then the expectation in this case is the function H(x), dened as:

for discrete X
i f (xi ) log2 f (xi );
H(x) =

f (x) log2 f (x)dx; for continuous X

(4.114)

(4.115)

known as the entropy of the random variable, or, its mean information content.
Chapter 10 explores how to use the concept of information and entropy to
develop appropriate probability models for practical problems in science and
engineering.

4.4.6

Probability Bounds

We now know that the pdf f (x) of a random variable contains all the
information about it to enable us compute the probabilities of occurrence of
various outcomes of interest. As valuable as this is, there are times when all
we need are bounds on probabilities, not exact values. We now discuss some
of the most important results regarding bounds on probabilities that can be
determined for any general random variable, X without specic reference to

120

Random Phenomena

any particular pdf. These results are very useful in analyzing the behavior of
random phenomena and have practical implications in determining values of
unknown population parameters.
We begin with a general lemma from which we then derive two important
results.

Lemma: Given a random variable X (with a pdf f (x)), and G(X)

a function of this random variable such that G(X) > 0, for an
arbitrary constant, c > 0,
P (G(X) c)

E[G(X)]
c

(4.116)

There are several dierent ways of proving this result; one of the most
direct is shown below.
Proof : By denition,

E[G(X)] =

G(x)f (x)dx

(4.117)

If we now divide the real line < x < into two mutually
exclusive regions, A = {x : G(x) c} and B = {x : G(x) < c}, i.e.
A is that region on the real line where G(x) c, and B is what is
left, then, Eq (4.117) becomes:

G(x)f (x)dx +
G(x)f (x)dx
(4.118)
E[G(X)] =
A

and since G(X) is non-negative, the second integral is 0, so that

E[G(X)]
G(x)f (x)dx
cf (x)dx
(4.119)
A

where the last inequality arises because, for all x A, (the region
over which we are integrating) G(x) c, with the net results that:
E[G(X)] cP (G(X) c)

(4.120)

because the last integral is, by denition, cP (A). From here, we

now obtain
E[G(X)]
(4.121)
P [G(X) c]
c
as required.

Random Variables and Distributions

121

This remarkable result holds for all random variables, X, and for any nonnegative functions of the random variable, G(X). Two specic cases of G(X)
give rise to results of special interest.
Markovs Inequality
When G(X) = X, Eq (4.116) immediately becomes:
P (X c)

E(X)
c

(4.122)

a result known as Markovs inequality. It allows us to place bounds on probabilities when only the mean value of a random variable is known. For example,
if the average number of inclusions on glass sheets manufactured in a specic
site is known to be 2, then according to Markovs inequality, the probability
of nding a glass sheet containing 5 or more inclusions at this manufacturing
site can never exceed 2/5. Thus if glass sheets containing 5 or more inclusions
are considered unsaleable, without reference to any specic probability model
of the random phenomenon in question, the plant manager concerned about
making unsaleable product can, by appealing to Markovs inequality, be sure
that things will never be worse than 2 in 5 unsaleable products.
It is truly remarkable, of course, that such statements can be made at all;
but in fact, this inequality is actually quite conservative. As one would expect,
with an appropriate probability model, one can be even more precise. (Table
2.1 in Chapter 2 in fact shows that the actual probability of obtaining 5 or
more inclusions on glass sheets manufactured at this site is 0.053, nowhere
close to the upper limit of 0.4 given by Markovs inequality.)
Chebychevs Inequality
Now let G(X) = (X )2 , and c = k 2 2 , where is the mean value of X,
and 2 is the variance, i.e. 2 = E[(x )2 ]. In this case, Eq (4.116) becomes
P [(X )2 k 2 2 ]

1
k2

(4.123)

which may be simplied to:

P (|X | k)

1
,
k2

(4.124)

a result known as Chebychevs inequality. The implication is that 1/k 2 is

an upper bound for the probability that any random variable will take on
values that deviate from the mean by more than k standard deviations. This
is still a rather weak inequality in the sense that in most cases, the indicated
probability is far less than 1/k 2 . Nevertheless, the added information of known
helps sharpen the bounds a bit, when compared to Markovs inequality. For
example, if we now add to the glass
sheets inclusions information the fact that
the variance is 2 (so that = 2), then, the desired probability P (X 5)

122

Random Phenomena

now translates to P (|X 2| 3) since = 2. In this case, therefore, k = 3,

and from Chebychevs inequality, we obtain:
P (|X 2| 3)

2
2
=
9
9

(4.125)

an upper bound which, even though still conservative, is nevertheless much

sharper than the 2/5 obtained earlier from Markovs inequality.
Chebychevs inequality plays a signicant role in Chapter 8 in establishing
a fundamental result relating relative frequencies in repeatable experiments
to the probabilities of occurrence of events.

4.5

Special Derived Probability Functions

In studying phenomena involving lifetimes (of humans and other living

organisms, or equipment, or, for that matter, social movements), or more
generally in studying the elapsed time until the occurrence of specic events
studies that encompass the related problem of reliability of equipment and
systems the application of probability theory obviously still involves the
use of the pdf f (x) and the cdf F (x), but in specialized forms unique to such
problems. The following is a discussion of special probability functions, derived
from f (x) and F (x), that have been customized for such applications. As a
result, these special probability functions are exclusively for random variables
that are (a) continuous, and (b) non-negative; they do not exist for random
variables that do not satisfy these conditions.

4.5.1

Survival Function

The survival function, S(x), is the probability that the random variable
X exceeds the specic value x; in lifetime applications, this translates to the
probability that the object of study survives beyond the value x, i.e.
S(x) = P (X > x)

(4.126)

From the denition of the cdf, F (x), we see immediately that

S(x) = 1 F (x)

(4.127)

so that where F (x) is a monotonically increasing function of x that starts at 0

and ends at 1, S(x) is the exact mirror image, monotonically decreasing from
1 to 0.
Example 4.8 SURVIVAL FUNCTION OF A CONTINUOUS
RANDOM VARIABLE

Random Variables and Distributions

123

Find the survival function S(x), for the random variable, X, the residence time in a CSTR, whose pdf is given in Eq (4.41). This function
directly provides the probability that any particular dye molecule survives in the CSTR beyond a time x.
Solution:
Observe rst that this random variable is continuous and non-negative
so that the desired S(x) does in fact exist. The required S(x) is given
by

1 x/
e
dx = ex/
(4.128)
S(x) =
x
We could equally well have arrived at the result by noting that the cdf
F (x) for this random variable is given by:
F (x) = (1 ex/ ).

(4.129)

Note from Eq (4.128) that with increasing x (residence time), survival

becomes smaller; i.e. the probability of still nding a dye molecule in
the reactor after a time x has elapsed diminishes exponentially with x.

4.5.2

Hazard Function

In reliability and life-testing studies, it is useful to have a means of directly

computing the probability of failure in the intervals beyond the current time, x,
for entities that have survived thus far; i.e. probabilities of failure conditioned
on survival until x. The hazard function, h(x), dened as follows:
h(x) =

f (x)
f (x)
=
S(x)
1 F (x)

(4.130)

provides just such a function. It does for future failure what f (x) does for
lifetimes in general. Recall that by denition, because X is continuous, f (x)
provides the (unconditional) probability of a lifetime in the innitesimal interval {xi < X < xi + dx} as f (xi )dx; in the same manner, the probability of
failure occurring in that same interval, given that the object of study survived
until the beginning of the current time interval, xi , is given by h(xi )dx. In
general
P (x < X < x + dx)
f (x)dx
=
(4.131)
h(x)dx =
S(x)
P (X > x)
so that, from the denition of conditional probability given in Chapter 3,
h(x)dx is seen as equivalent to P (x < X < x + dx|X > x). h(x) is therefore
sometimes referred to as the death rate of failure rate at x of those surviving until x (i.e. of those at risk at x); it describes how the risk of failure
changes with age.
Example 4.9 HAZARD FUNCTION OF A CONTINUOUS
RANDOM VARIABLE

124

Random Phenomena
Find the hazard function h(x), for the random variable, X, the residence
time in a CSTR.
Solution:
From the given pdf and the survival function obtained in Example 4.8
above, the required function h(x) is given by,
h(x) =

1 x/
e

ex/

(4.132)

a constant, with the interesting implication that the probability that a

dye molecule exits the reactor immediately after time x, given that it
had stayed in the reactor until then, is independent of x. Thus molecules
that have survived in the reactor until x have the same chance of exiting
the rector immediately after this time as the chance of exiting at any
other time in the future no more, no less. Such a random variable
is said to be memoryless; how long it lasts beyond the current time
does not depend on its current age.

4.5.3

Cumulative Hazard Function

Analogous to the cdf, F (x), the cumulative hazard function, H(x), is dened
as:
x
H(x) =
h(u)du
(4.133)
0

It can be shown that H(x) is related to the more well-known F (x) according
to
(4.134)
F (x) = 1 eH(x)
and that the relationship between S(x) and H(x) is given by:
S(x) = eH(x)

(4.135)

H(x) = log[S(x)]

(4.136)

or, conversely,

4.6

Summary and Conclusions

We are now in a position to look back at this chapter and observe, with
some perspective, how the introduction of the seemingly innocuous random
variable, X, has profoundly aected the analysis of randomly varying phenomena in a manner analogous to how the introduction of the unknown
quantity, x, transformed algebra and the solution of algebraic problems. We
have seen how the random variable, X, maps the sometimes awkward and

Random Variables and Distributions

125

tedious sample space, , into a space of real numbers; how this in turn leads
to the emergence of f (x), the probability distribution function (pdf); and
how f (x) has essentially supplanted and replaced the probability set function,
P (A), the probability analysis tool in place at the end of Chapter 3.
The full signicance of the role of f (x) in random phenomena analysis may
not be completely obvious now, but it will become more so as we progress in
our studies. So far, we have used it to characterize the random variable in
terms of its mathematical expectation, and the expectation of various other
functions of the random variable. And this has led, among other things, to our
rst encounter with the mean, variance, skewness and kurtosis, of a random
variable, important descriptors of data that we are sure to encounter again
later (in Chapter 12 and beyond).
Despite initial appearances, every single topic discussed in this chapter
nds useful application in later chapters. In the meantime, we have taken
pains to try and breathe some practical life into many of these typically dry
and formal denitions and mathematical functions. But if some, especially
the moment generating function, the characteristic function, and entropy, still
appear to be of dubious practical consequence, such lingering doubts will be
dispelled completely by Chapters 6, 8, 9 and 10. Similarly, the probability
bounds (especially Chebyshevs inequality) will be employed in Chapter 8,
and the special functions of Section 4.5 will be used extensively in their more
natural setting in Chapter 23.
The task of building an ecient machinery for random phenomena analysis, which began in Chapter 3, is now almost complete. But before the generic
pdf, f (x), introduced and characterized in this chapter begins to take on
specic, distinct personalities for various random phenomena, some residual
issues remain to be addressed in order to complete the development of the
probability machinery. Specically, the discussion in this chapter will be extended to higher dimensions in Chapter 5, and the characteristics of functions
of random variables will be explored in Chapter 6. Chapter 7 is devoted to
two application case studies that put the complete set of discussions in Part
II in perspective.
Here are some of the main points of the chapter again.
Formally, the random variable, Xdiscrete or continuousassigns to
each element , one and only one real number, X() = x, thereby
mapping onto a new space, V ; informally it is an experimental outcome whose numerical value is subject to random variations with each
exact replicate trial of the experiment.
The introduction of the random variable, X, leads directly to the emergence of f (x), the probability distribution function; it represents how the
probabilities of occurrence of all the possible outcomes of the random
experiment of interest are distributed over the entire random variable
space, and is a direct extension of P (A).

126

Random Phenomena

The cumulative distribution function (cdf), F (x), is P (X x); if dis xi

crete F (xi ) = ij=0 f (xj ); if continuous,
f (x)dx, so that if dierentiable,

dF (x)
dx

= f (x).

The mathematical expectation of a random variable, E(X), is dened

as;

discrete;
i xi f (xi );
E(X) =
x(f
(x)dx;
continuous

It exists only when i |xi |f (xi ) < (absolute convergence for discrete

random variables) or |x|(f (x)dx < (absolute integrability for

continuous random variables).

E[G(X)] provides various characterizations of the random variables, X,

for various functions G(X):
G(X) = (X )k yields the k th moment of X;
G(X) = etX and G(X) = ejtX respectively yield the moment generating function (MGF), and the characteristic function (CF), of
X;
G(X) = log2 f (x) yields the entropy of X.
The mean, indicates the central location or center of gravity of the
random variable while the variance, skewness and kurtosis indicate the
shape of the distribution in relation to the mean. Additional characterization is provided by the mode, where the distribution is maximum
and by the median, which divides the distribution into two equal probability halves; the quartiles, which divide the distribution into four equal
probability quarters, or more generally, the percentiles, which divide the
distribution into 100 equal probability portions.
Lifetimes and related phenomena are more conveniently studied with
special probability functions, which include:
The survival function, S(x), the probability that X exceeds the
value x; by denition, it is related to F (x) according to S(x) =
1 F (x);
The hazard function, h(x), which does for future failure probabilities what f (x) does for lifetime probabilities; and
The cumulative hazard function, H(x), which is to the hazard function, h(x), what the cdf F (x) is to the pdf f (x).

Random Variables and Distributions

127

REVIEW QUESTIONS
1. Why is the raw sample space, , often tedious to describe and inecient to analyze mathematically?
2. Through what means is the general sample space converted into a space with real
numbers?
3. Formally, what is a random variable?
4. What two mathematical transformations occur as a consequence of the formal
introduction of the random variable, X?
5. How is the induced probability set function, PX , related to the probability set
function, P , dened on ?
6. What is the pre-image, A , of the set A?
7. What is the relationship between the random variable, X, and the associated real
number, x? What does the expression, P (X = x) indicate?
8. When does the sample space, , naturally occur in the form of the random variable space, V ?
9. Informally, what is a random variable?
10. What is the dierence between a discrete random variable and a continuous one?
11. What is the pdf, f (x), and what does it represent for the random variable, X?
12. What is the relationship between the pdf, f (xi ), and the cdf, F (xi ), for a discrete random variable, X?
13. What is the relationship between the pdf, f (x), and the cdf, F (x), for a continuous random variable, X?
14. Dene mathematically the expected value, E(X), for a discrete random variable
and for a continuous one.
15. What conditions must be satised for E(X) to exist?
16. Is E(X) a random variable and does it have units?
17. What is the relationship between the expected value, E(X), and the mean value,
of a random variable (or equivalently, of its distribution)?
18. Distinguish between ordinary moments and central moments of a random variable.

128

Random Phenomena

19. What are the common names by which the second, third and fourth central
moments of a random variable are known?
20. What is Cv , the coecient of variation of a random variable?
21. What is the distinguishing characteristic of a skewed distribution (positive or
negative)?
22. Give an example each of a negatively skewed and a positively skewed randomly
varying phenomenon.
23. What do the mean, variance, skewness, and kurtosis tell us about the distribution of the random variable in question?
24. What do Mn , Mw , and Mz represent for a polymer material?
25. What is the polydispersity index of a polymer and what does it indicate about
the molecular weight distribution?
26. Dene the moment generating function (MGF) of a random variable, X. Why
is it called by this name?
27. What is the uniqueness property of the MGF?
28. Dene the characteristic function of a random variable, X. What distinguishes
it from the MGF?
29. How are the MGF and characteristic function (CF) of a random variable related
to the Laplace and Fourier transforms?
30. Dene the mode, median, quartiles and percentiles of a random variable.
31. Within the context of this chapter, what is Entropy?
32. Dene Markovs inequality. It allows us to place probability bounds when what
is known about the random variable?
33. Dene Chebychevs inequality.
34. Which probability bound is sharper, the one provided by Markovs inequality
or the one provided by Chebychevs?
35. What are the dening characteristics of those random variables for which the special probability functions, the survival and hazard functions, are applicable? These
functions are used predominantly in studying what types of phenomena?
36. Dene the survival function, S(x). How is it related to the cdf, F (x)?

Random Variables and Distributions

129

37. Dene the hazard function, h(x). How is it related to the pdf, f (x)?
38. Dene the cumulative hazard function, H(x). How is it related to the cdf, F (x),
and the survival function, S(x)?

EXERCISES
Section 4.1
4.1 Consider a family that plans to have a total of three children; assuming that
they will not have any twins, generate the sample space, , for the possible outcomes. By dening the random variable, X as the total number of female children
born to this family, obtain the corresponding random variable space, V . Given that
this particular family is genetically predisposed to having boys, with a probability,
p = 0.75 of giving birth to a boy, obtain the probability that this family will have
three boys and compare it to the probability of having other combinations.
4.2 Revisit Example 4.1 in the text, and this time, instead of tossing a coin three
times, it is tossed 4 times. Generate the sample space, ; and using the same denition of X as the total number of tails, obtain the random variable space, V , and
compute anew the probability of A, the event that X = 2.
4.3 Given the spaces and V for the double dice toss experiment in Example 4.3
in the text,
(i) Compute the probability of the event A that X = 7;
(ii) If B is the event that X = 6, and C the event that X = 10 or X = 11, compute
P (B) and P (C).
Section 4.2
4.4 Revisit Example 4.3 in the text on the double dice toss experiment and obtain
the complete pdf f (x) for the entire random variable space. Also obtain the cdf,
F (x). Plot both distribution functions.
4.5 Given the following probability distribution function for a discrete random variable, X,
x
f (x)

1
0.10

2
0.25

3
0.30

4
0.25

5
0.10

(i) Obtain the cdf F (x).

(ii) Obtain P (X 3); P (X < 3); P (X > 3); P (2 X 4)
4.6 A particular discrete random variable, X, has the cdf
F (x) =

x k
n

; x = 1, 2, . . . , n

(4.137)

where k and n are constants characteristic of the underlying random phenomenon.

Determine f (x), the pdf for this random variable, and, for the specic values
k = 2, n = 8, compute and plot f (x) and F (x).

130

Random Phenomena

4.7 The random variable, X, has the following pdf:

cx 0 < x < 1
f (x) =
0
otherwise

(4.138)

(i) First obtain the value of the constant, c, required for this to be a legitimate pdf,
and then obtain an expression for the cdf F (x).
(ii) Obtain P (X 1/2) and P (X 1/2).
(iii) Obtain the value xm such that
P (X xm ) = P (X xm )

(4.139)

4.8 From the distribution of residence times in an ideal CSTR is given in Eq (4.41),
determine, for a reactor with average residence time, = 30 mins, the probability
that a reactant molecule (i) spends less than 30 mins in the reactor; (ii) spends more
than 30 mins in the reactor; (iii) spends less than (30 ln 2) mins in the reactor; and
(iv) spends more than (30 ln 2) mins in the reactor.
Section 4.3
4.9 Determine E(X) for the discrete random variable in Exercise 4.5; for the continuous random variable in Exercise 4.6; and establish that E(X) for the residence
time distribution in Eq (4.41) is , thereby justifying why this parameter is known
as the mean residence time.
4.10 (Adapted from Stirzaker, 20031 ) Show that E(X) exists for the discrete random
variable, X, with the pdf:
f (x) =

4
; x = 1, 2, . . .
x(x + 1)(x + 2)

(4.140)

while E(X) does not exist for the discrete random random variable with the pdf
f (x) =

1
; x = 1, 2, . . .
x(x + 1)

(4.141)

4.11 Establish that E(X) = 1/p for a random variable X whose pdf is
f (x) = p(1 p)x1 ; x = 1, 2, 3, . . .

(4.142)

by dierentiating with respect to p both sides of the expression:

p(1 p)x1 = 1

(4.143)

x=1

4.12 From the denition of the mathematical expectation function, E(.), establish
that for the random variable, X, discrete or continuous:
E[k1 g1 (X) + k2 g2 (X)] = k1 E[g1 (X)] + k2 E[g2 (X)],

(4.144)

and that given E(X) = ,

E[(X )3 ] = E(X 3 ) 3 2 3

(4.145)

1 D. Stirzaker, (2003). Elementary Probability, 2nd Ed., Cambridge University Press,

p120.

Random Variables and Distributions

131

where 2 is the variance, dened by 2 = V ar(X) = E[(X )2 ].

Section 4.4
4.13 Show that for two random variables X and Y , and a third random variable
dened as
Z =X Y
(4.146)
show, from the denition of the expectation function, that regardless of whether the
random variables are continuous or discrete,
E(Z)

E(X) E(Y )

i.e., Z

X Y

(4.147)

and that
V ar(Z) = V ar(X) + V ar(Y )

(4.148)

when E[(X X )(Y Y )] = 0 (i.e., when X and Y are independent: see Chapter 5).
4.14 Given that the pdf of a certain discrete random variable X is:
f (x) =

x e
; x = 0, 1, 2, . . .
x!

(4.149)

Establish the following results:

f (x)

(4.150)

E(X)
V ar(X)

(4.151)

(4.152)

x=0

4.15 Obtain the variance and skewness of the discrete random variable in Exercise
4.5 and for the continuous random variable in Exercise 4.6. Which random variables
distribution is skewed and which is symmetric?
4.16 From the formal denitions of the moment generating function, establish Eqns
(4.95) and (4.96).
4.17 Given the pdf for the residence time for two identical CSTRs in series as
f (x) =

1 x/
xe
2

(4.153)

(i) obtain the MGF for this pdf and compare it with that derived in Example 4.7 in
the text. From this comparison, what would you conjecture to be the MGF for the
distribution of residence times for n identical CSTRs in series?
(ii) Obtain the characteristic function for the pdf in Eq (4.41) for the single CSTR
and also for the pdf in Eq (4.153) for two CSTRs. Compare the two characteristic
functions and conjecture what the corresponding characteristic function will be for
the distribution of residence times for n identical CSTRs in series.
4.18 Given that M (t) is the moment generating function of a random variable,
dene the psi-function, (t), as:
(t) = ln M (t)

(4.154)

132

Random Phenomena

(i) Prove that (0) = , and (0) = 2 , where each prime indicates dierentiation
with respect to t; and E(X) = , is the mean of the random variable, and 2 is the
variance, dened by 2 = V ar(X) = E[(X )2 ].
(ii) Given the pdf of a discrete random variable X as:
f (x) =

x e
; x = 0, 1, 2, . . .
x!

obtain its (t) function and show, using the results in (i) above, that the mean and
variance of this pdf are identical.
4.19 The pdf for the yield data discussed in Chapter 1 was postulated as
f (y) =

(y)2
1
e 22 ; < y <
2

(4.155)

If we are given that is the mean, rst establish that the mode is also , and then
use the fact that the distribution is perfectly symmetric about to establish that
median is also , hence conrming that for this distribution, the mean, mode and
median coincide.
4.20 Given the pdf:
1 1
; < x <
(4.156)
1 + x2
nd the mode and the median and show that they coincide. For extra credit:
Establish that = E(X) does not exist.
f (x) =

4.21 Compute the median and the other quartiles for the random variable whose
pdf is given as:

x 0<x<2
f (x) =
(4.157)
0 otherwise
4.22 Given the binary random variable, X, that takes the value 1 with probability
p, and the value 0 with probability (1 p), so that its pdf is given by

1 p x = 0;
p
x = 1;
(4.158)
f (x) =

0
elsewhere.
obtain an expression for the entropy H(X) and show that it is maximized when
p = 0.5, taking on the value H (X) = 1 at this point.
Section 4.5
4.23 First show that the cumulative hazard function, H(x), for the random variable,
X, the residence time in a CSTR is the linear function,
H(x) = x

(4.159)

(where = 1 ). Next, for a related random variable, Y , whose cumulative hazard

function is given by
(4.160)
H(y) = (y)

Random Variables and Distributions

133

where is a constant parameter, show that the corresponding survival function is

S(y) = e(x)

(4.161)

and from here obtain the pdf, f (y), for this random variable.
4.24 Given the pdf for the residence time for two identical CSTRs in series in Exercise 4.17, Eq (4.153), determine the survival function, S(x), and the hazard function,
h(x). Compare them to the corresponding results obtained for the single CSTR in
Example 4.8 and Example 4.9 in the text.

APPLICATION PROBLEMS
4.25 Before an automobile parts manufacturer takes full delivery of polymer resins
made by a supplier in a reactive extrusion process, a sample is processed and the
performance is tested for Toughness. The batch is either accepted (if the processed
samples Toughness equals or exceeds 140 J/m3 ) or it is rejected. As a result of
process and raw material variability, the acceptance/rejection status of each batch
varies randomly. If the supplier sends four batches weekly to the parts manufacturer, and each batch is made independently on the extrusion process, so that the
ultimate fate of one batch is independent of the fate of any other batch, dene X
as the random variable representing the number of acceptable batches a week and
answer the following questions:
(i) Obtain the sample space, , and the corresponding random variable space, V .
(ii) First, assume equal probability of acceptance and rejection, and obtain the the
pdf, f (x), for the entire sample space. If, for long term protability it is necessary
that at least 3 batches be acceptable per week, what is the probability that the
supplier will remain protable?
4.26 Revisit Problem 4.25 above and consider that after an extensive process and
control system improvement project, the probability of acceptance of a single batch
is improved to 0.8; obtain the new pdf, f (x). If the revenue from a single acceptable
batch is $20,000, but every rejected batch costs the supplier $8,000 in retrieval and
incineration fees, which will be deducted from the revenue, what is the expected net
revenue per week under the current circumstances?
4.27 A gas station situated on a back country road has only one gasoline pump
and one attendant and, on average, receives = 3 (cars/hour). The average rate at
which this lone attendant services the cars is (cars/hour). It can be shown that
the total number of cars at this gas station at any time (i.e. the one currently being
served, and those waiting in line to be served) is the random variable X with the
following pdf:
x

; x = 0, 1, 2, . . .
(4.162)
f (x) = 1

(i) Show that so long as < , the probability that the line at the gas station is
innitely long is zero.
(ii) Find the value of required so that the expected value of the total number of

134

Random Phenomena

cars at the station is 2.

(iii) Using the value obtained in (ii), nd the probability that there are more than
two cars at the station, and also the probability that there are no cars.
4.28 The distribution of income of families in the US in 1979 (in actual dollars
uncorrected for ination) is shown in the table below:
Income level, x,
( $103 )
05
510
1015
1520
2025
2530
3035
3540
4045
4550
5055
> 55

Percent of Population
with income level, x
4
13
17
20
16
12
7
4
3
2
1
1

(i) Plot the data histogram and comment on the shape.

(ii) Using the center of the interval to represent each income group, determine the
mean, median, mode; and the variance and skewness for this data set. Comment on
how consistent the numerical values computed for these characteristics are with the
shape of the histogram.
(iii) If the 1979 population is broadly classied according to income into Lower
Class for income range (in thousands of dollars) 015, Middle Class for income
range, 1550 and Upper Class for income range > 50, what is the probability that
two people selected at random and sequentially to participate in a survey from the
Census Bureau (in preparation for the 1980 census) are (a) both from the Lower
Class, (b) both from the Middle Class, (c) one from the Middle class and one
from the Upper class, and (d) both from the Upper class?
(iv) If, in 1979, engineers with at least 3 years of college education (excluding graduate students) constitute approximately 1% of the population, (2.2 million out of 223
million) and span the income range from 2055, determine the probability that an
individual selected at random from the population is in the middle class given that
he/she is an engineer. Determine the converse, that the person selected at random
is an engineer given that he/she is in the middle class.
4.29 Life-testing results on a rst generation microprocessor-based (computercontrolled) toaster indicate that X, the life-span (in years) of the central control
chip, is a random variable that is reasonably well-modeled by the pdf:
f (x) =

1 x/
e
;x > 0

(4.163)

with = 6.25. A malfunctioning chip will have to be replaced to restore proper

toaster operation.
(i) The warranty for the chip is to be set at xw years (in whole integers) such that

Random Variables and Distributions

135

no more than 15% would have to be replaced before the warranty period expires.
Find xw .
(ii) In planning for the second generation toaster, design engineers wish to set a target value to aim for ( = 2 ) such that 85% of the second generation chips survive
beyond 3 years. Determine 2 and interpret your results in terms of the implied
fold increase in mean life-span from the rst to the second generation of chips.
4.30 The probability of a single transferred embryo resulting in a live birth in an
in-vitro fertilization treatment, p, is given as 0.5 for a younger patient and 0.2 for
an older patient. When n = 5 embryos are transferred in a single treatment, it is
also known that if X is the total number of live births resulting from this treatment,
then E(X) = 2.5 for the younger patient and E(X) = 1 for the older patient, and
the associated variance, V ar(X) = 1.25 for the younger and V ar(X) = 0.8 for the
older.
(i) Use Markovs inequality and Chebyshevs inequality to obtain bounds on the
probability of each patient giving birth to quadruplets or a quintuplets at the end
of the treatment.
(ii) These bounds are known to be quite conservative, but to determine just how
conservative , compute the actual probabilities of the stated events for each patient
given that an appropriate pdf for X is
f (x) =

5!
px (1 p)5x
x!(5 x)!

(4.164)

where p is as given above. Compare the actual probabilities with the Markov and
Chebychev bounds and identify which bound is sharper.
4.31 The following data table, obtained from the United States Life Tables 196971,
(published in 1973 by the National Center for Health Statistics) shows the probability of survival until the age of 65 for individuals of the given age2 .
Age
y
0
10
20
30
35
40
45
50
55
60

Prob of survival
to age 65
0.72
0.74
0.74
0.75
0.76
0.77
0.79
0.81
0.85
0.90

The data should be interpreted as follows: the probability that all newborns, and
children up to the age of ten survive until 65 years of age is 0.72; for those older
than 10 and up to 20 years, the probability of survival to 65 years is 0.74, and so on.
Assuming that the data is still valid in 1975, a community cooperative wishes to
2 More up-to-date versions, available, for example, in National Vital Statistics Reports,
Vol. 56, No. 9, December 28, 2007 contain far more detailed information.

136

Random Phenomena

set up a life insurance program that year whereby each participant pays a relatively
small annual premium, $, and, in the event of death before 65 years, a one-time
death gratuity payment of $ is made to the participants designated beneciary.
If the participant survives beyond 65 years, nothing is paid. If the cooperative is
to realize a xed, modest expected revenue, $RE = $30, per year, per participant,
over the duration of his/her participation (mostly to cover administrative and other
costs) provide answers to the following questions:
(i) For a policy based on a xed annual premium of $90 for all participants, and
age-dependent payout, determine values for (y), the published payout for a person
of age y that dies before age 65, for all values of y listed in this table.
(ii) For a policy based instead on a xed death payout of $8, 000, and age-dependent
annual premiums, determine values for (y), the published annual premium to be
collected each year from a participant of age y.
(iii) If it becomes necessary to increase the expected revenue by 50% as a result
of increased administrative and overhead costs, determine the eect on each of the
policies in (i) and (ii) above.
(iv) If by 1990, the probabilities of survival have increased across the board by 0.05,
determine the eect on each of the policies in (i) and (ii).

Chapter 5
Multidimensional Random Variables

5.1

5.2

5.3

5.4

Introduction and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.1.1 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 2-Dimensional (Bivariate) Random Variables . . . . . . . . . . . . . . . . . . . . .
5.1.3 Higher-Dimensional (Multivariate) Random Variables . . . . . . . . . . .
Distributions of Several Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.3 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.4 General Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Distributional Characteristics of Jointly Distributed Random Variables
5.3.1 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Marginal Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conditional Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

137
138
139
140
141
141
144
147
152
153
154
155
156
157
158
163
164
166
168

Servant of God, well done,

well has thou fought the better ght,
who single hast maintained,
against revolted multitudes the cause of truth,
in word mightier than they in arms.
John Milton (16081674)

When the outcome of interest in an experiment is not one, but two or more
variables simultaneously, additional issues arise that are not fully addressed
by the probability machinery as it stands at the end of the last chapter. The
concept of the random variable, restricted as it currently is to the single, onedimensional random variable X, needs to be extended to higher dimensions;
and doing so is the sole objective of this chapter. With the introduction of a
few new concepts, new varieties of the probability distribution function (pdf)
emerge along with new variations on familiar results; together, they expand
and supplement what we already know about random variables and bring to
a conclusion the discussion we started in Chapter 4.
137

138

5.1
5.1.1

Random Phenomena

Introduction and Definitions

Perspectives

Consider a clinical study of the additional eects of the Type 2 diabetes

drug, Avandia , in which a group of 193 patients with type 2 diabetes who
had undergone cardiac bypass surgery were randomly assigned to receive the
drug or a placebo. After one year, the researchers reported that patients taking
Avandia not only had better blood sugar control, they also showed improved
cholesterol levels, fewer signs of inammation of blood vessels, and lower blood
pressure, compared with those on a placebo.1
Extracting the desired scientic information accurately and eciently from
the clinical study data, of course, relies on many principles of probability,
statistical analysis and experimental design issues that are not of concern
at this moment. For purposes of this chapters discussion, we restrict our
attention to the basic (but central) fact that for each patient in the study,
the result of interest involves not one but several variables simultaneously,
including: (i) blood sugar level, (ii) cholesterol levels, (more specically, the
low-density lipoprotein, or LDL, version, and the high-density lipoprotein,
or HDL, version), and (iii) blood pressure, (more specically, the systolic and
the diastolic pressures).
This is a real-life example of an experiment whose outcome is intrinsically
multivariate, consisting of several distinct variables, each subject to random
variability. As it currently stands, the probability machinery of Chapter 4 is
only capable of dealing with one single random variable at a time. As such,
we are only able to use it to characterize the variability inherent in each of the
variables of interest one at a time. This raises some important questions that
we did not have to contend with when dealing with single random variables:
1. Do these physiological variables vary jointly or separately? For example,
do patients with high LDL cholesterol levels tend to have high systolic
blood pressures also, or do the levels of one have nothing in common
with the levels of the other?
2. If there is even the remotest possibility that one variable interacts
with another, can we deal with each variable by itself as if the others do
not exist without incurring serious errors?
3. If we accept that until proven otherwise these variables should be considered jointly, how should such joint variabilities be represented?
4. What other aspects of the joint behavior of inter-related random vari1 Avandia May Slow Atherosclerosis After Bypass Surgery, by Steven Reinberg, US
News and World Report, April 1, 2008.

Multidimensional Random Variables

139

ables provide useful means of characterizing jointly varying random variables?

These questions indicate that what we know about random variables from
Chapter 4 must be extended appropriately to enable us deal with the new
class of issues that arise when multiple random variables must be considered
simultaneously.
The logical place to begin, of course, is with a 2-dimensional (or bivariate)
random variable, before extending the discussion to the general case with
n > 2 variables.

5.1.2

2-Dimensional (Bivariate) Random Variables

The following is a direct extension of the formal denition of the single

random variable given in Chapter 4.

Denition: Bivariate Random Variable.

Given a random experiment with a sample space , and a probability set function P (.) dened on its subsets; let there be a function
X, dened on , which assigns to each element , one and
only one ordered number pair (X1 (), X2 ()). This function, X,
is called a two-dimensional, or bivariate random variable.

As with the single random variable case, associated with this twodimensional random variable is a space, V , and a probability set function
PX induced by X = (X1 , X2 ), where V is dened as:
V = {(x1 , x2 ) : X1 () = x1 , X2 () = x2 ; }

(5.1)

The most important point to note at this point is that the random variable
space V involves X1 and X2 simultaneously; it is not merely a union of separate spaces V1 for X1 and V2 for X2 .
An example of a bivariate random variable was presented in Example 4.4
in Chapter 4; here is another.
Example 5.1 BIVARIATE RANDOM VARIABLE AND INDUCED PROBABILITY FUNCTION FOR COIN TOSS EXPERIMENT
Consider an experiment involving tossing a coin 2 times and recording
the number of observed heads and tails: (1) Obtain the sample space ;
and (2) Dene X as a two-dimensional random variable (X1 , X2 ) where
X1 is the number of heads obtained in the rst toss, and X2 is the number of heads obtained in the second toss. Obtain the new space V . (3)
Assuming equiprobable outcomes, obtain the induced probability PX .

140

Random Phenomena
Solution:
(1) From the nature of the experiment, the required sample space, , is
given by
= {HH, HT, T H, T T }
(5.2)
consisting of all 4 possible outcomes, which may be represented respectively, as i ; i = 1, 2, 3, 4, so that
= {1 , 2 , 3 , 4 }.

(5.3)

(2) By denition of X, we see that X(1 ) = (1, 1); X(2 ) =

(1, 0); X(3 ) = (0, 1)X(4 ) = (0, 0); so that the space V is given by:
V = {(1, 1); (1, 0); (0, 1); (0, 0)}

(5.4)

since these are all the possible values that the two-dimensional X can
take.
(3) This is a case where there is a direct one-to-one mapping between
the 4 elements of the original sample space and the induced random
variables space V ; as such, for equiprobable outcomes, we obtain,
PX (1, 1)

1/4

PX (1, 0)

1/4

PX (0, 1)

1/4

PX (0, 0)

1/4

(5.5)

In making sense of the formal denition given here for the bivariate (2dimensional) random variable, the reader should keep in mind the practical
considerations presented in Chapter 4 for the single random variable. The same
issues there apply here. In a practical sense, the bivariate random variable
may be considered simply, if informally, as an experimental outcome with two
components, each with numerical values that are subject to random variations
with exact replicate performance of the experiment.
For example, consider a polymer used for packaging applications, for which
the quality measurements of interest are melt index (indicative of the molecular weight distribution), and density (indicative of co-polymer composition). With each performance of lab analysis on samples taken from the manufacturing process, the values obtained for each of these quantities are subject
to random variations. Without worrying so much about the original sample
space or the induced one, we may consider the packaging polymer quality characteristics directly as the two-dimensional random variable whose components
are melt index (as X1 ), and density (as X2 ).
We now note that it is fairly common for many textbooks to use X and Y
to represent bivariate random variables. We choose to use X1 and X2 because
it oers a notational convenience that facilitates generalization to n > 2.

5.1.3

Higher-Dimensional (Multivariate) Random Variables

The foregoing discussion is generalized to n > 2 as follows.

Multidimensional Random Variables

141

Denition: Multivariate Random Variable.

Given a random experiment with a sample space , and a probability set function P (.) dened on its subsets; let there be a function
X, dened on which assigns to each element , one and only
one ntuple (X1 (), X2 (), , Xn ()) to each element .
This function, X, is called an n-dimensional random variable.

Similarly, associated with this n-dimensional random variable is a space,

V:
V = {(x1 , x2 , , xn ) : X1 () = x1 , X2 () = x2 ; , Xn () = xn ; }
(5.6)
and a probability set function PX induced by X.
As a practical matter, we may observe, for example, that in the Avandia
study mentioned at the beginning of this chapter, the outcome of interest
for each patient is a continuous, 5-dimensional random variable, X, whose
components are: X1 = Blood sugar level; X2 = LDL cholesterol level; X3 =
HDL cholesterol level; X4 = systolic blood pressure; and X5 = diastolic blood
pressure. The specic observed values for each patient will be the quintuple
measurement (x1 , x2 , x3 , x4 , x5 ).
Everything we have discussed above for the bivariate random variable n =
2 extends directly for the general n.

5.2
5.2.1

Distributions of Several Random Variables

Joint Distributions

The results of Example 5.1 can be written as:

1/4; x1 = 1, x2

1/4; x1 = 1, x2
1/4; x1 = 0, x2
f (x1 , x2 ) =

1/4; x1 = 0, x2

0;
otherwise

=1
=0
=1
=0

(5.7)

showing how the probabilities are distributed over the 2-dimensional random variable space, V . Once again, we note the following about the function
f (x1 , x2 ):
f (x1 , x2 ) > 0; x1 , x2

x1
x2 f (x1 , x2 ) = 1

142

Random Phenomena

We may now generalize beyond this specic example as follows:

Denition: Joint pdf

Let there exist a sample space (along with a probability set
function, P , dened on its subsets), and a random variable X =
(X1 , X2 ), with an attendant random variable space V : a function
f dened on V such that:
1. f (x1 , x2 ) 0; x1 , x2 V ;

2.
x1
x2 f (x1 , x2 ) = 1; x1 , x2 V ;
3. PX (X1 = x1 , X2 = x2 ) = f (x1 , x2 )
is called the joint probability distribution function of the discrete
two-dimensional random variable X = (X1 , X2 ).

These results are direct extensions of the axiomatic statements given earlier
for the discrete single random variable pdf.
The probability that both X1 < x1 and X2 < x2 is given by the cumulative
distribution function,
F (x1 , x2 ) = P (X1 < x1 , X2 < x2 )

(5.8)

valid for discrete and continuous random variables. When F is a continuous

function of both x1 and x2 and possesses rst partial derivatives, the twodimensional function,
f (x1 , x2 ) =

F (x1 , x2 )
x1 x2

(5.9)

is called the joint probability density function for the continuous twodimensional random variables X1 and X2 . As with the discrete case, the formal
properties of the continuous joint pdf are:

1. f (x1 , x2 ) 0; x1 , x2 V ;
2. f has at most a nite number of discontinuities in every nite interval
in V ;

3. The double integral, x1 x2 f (x1 , x2 )dx1 dx2 = 1;

4. PX (A) = A f (x1 , x2 )dx1 , dx2 ; for A V

Multidimensional Random Variables

143

Thus,

P (a1 X1 a2 ; b1 X2 b2 ) =

f (x1 , x2 )dx1 dx2

(5.10)

These results generalize directly to the multidimensional random variable

X = (X1 , X2 , , Xn ) with a joint pdf f (x1 , x2 , , xn ).
Example 5.2 JOINT PROBABILITY DISTRIBUTION OF
CONTINUOUS BIVARIATE RANDOM VARIABLE
The reliability of the temperature control system for a commercial,
highly exothermic polymer reactor is known to depend on the lifetimes
(in years) of the control hardware electronics, X1 , and of the control
valve on the cooling water line, X2 . If one component fails, the entire
control system fails. The random phenomenon in question is characterized by the two-dimensional random variable (X1 , X2 ) whose joint
probability distribution is given as:
1 (0.2x +0.1x )
1
2
; 0 < x1 <
50 e
f (x1 , x2 ) =
(5.11)
0 < x2 <

0
elsewhere
(1) Establish that this is a legitimate pdf; and (2) obtain the probability
that the system lasts more than two years; (3) obtain the probability
that the electronic component functions for more than 5 years and the
control valve for more than 10 years.
Solution:
(1) If this is a legitimate joint pdf, then the following should hold:

f (x1 , x2 )dx1 dx2 = 1
(5.12)
0

In this case, we have:

1 (0.2x1 +0.1x2 )
dx1 dx2
e
50
0
0

=
=

1
5e0.2x1 0
10e0.1x2 0
50
1
(5.13)

We therefore conclude that the given joint pdf is legitimate.

(2) For the system to last more than 2 years, both components must
simultaneously last more than 2 year. The required probability is therefore given by:

1 (0.2x1 +0.1x2 )
e
dx1 dx2
(5.14)
P (X1 > 2, X2 > 2) =
50
2
2
which, upon carrying out the indicated integration and simplifying, reduces to:
P (X1 > 2, X2 > 2) = e0.4 e0.2 = 0.67 0.82 = 0.549

(5.15)

Thus, the probability that the system lasts beyond the rst two years

144

Random Phenomena
is 0.549.
(3) The required probability, P (X1 > 5; X2 > 10) is obtained as:

1 (0.2x1 +0.1x2 )
e
dx1 dx2
P (X1 > 5; X2 > 10) =
50
10
5

e0.1x2 10 = (0.368)2
=
e0.2x1 5
=

0.135

(5.16)

The preceding discussions have established the joint pdf f (x1 , x2 , , xn )

as the most direct extension of the single variable pdf f (x) of Chapter 4
to higher-dimensional random variables. However, additional distributions
needed to characterize other aspects of multidimensional random variables
can be derived from these joint pdfs distributions that we had no need for
in dealing with single random variables. We will discuss these new varieties of
distributions rst for the 2-dimensional (bivariate) random variable, and then
extend the discussion to the general n > 2.

5.2.2

Marginal Distributions

Consider the joint pdf f (x1 , x2 ) for the 2-dimensional random variable
(X1 , X2 ); it represents how probabilities are jointly distributed over the entire
(X1 , X2 ) plane in the random variable space. Were we to integrate over the
entire range of X2 (or sum over the entire range in the discrete case), what is
left is the following function of x1 in the continuous case:

f1 (x1 ) =
f (x1 , x2 )dx2
(5.17)

or, in the discrete case,

f1 (x1 ) =

f (x1 , x2 )

(5.18)

This function, f1 (x1 ), characterizes the behavior of X1 alone, by itself, regardless of what is going on with X2 .
Observe that, if one wishes to determine P (a1 < X1 < a2 ) with X2 taking
any value, by denition, this probability is determined as:

a2
f (x1 , x2 )dx2 dx1
(5.19)
P (a1 < X1 < a2 ) =
a1

But according to Eq (5.17), the terms in the parentheses represent f1 (x1 ),

hence:
a2
f1 (x1 )dx1
(5.20)
P (a1 < X1 < a2 ) =
a1

an expression that is reminiscent of probability computations for single random variable pdfs.

Multidimensional Random Variables

145

The function in Eq (5.17) is known as the marginal distribution of X1 ; and

by the same token, the marginal distribution of X2 , in the continuous case, is
given by:

f2 (x2 ) =

f (x1 , x2 )dx1 ,

(5.21)

obtained by integrating out X1 from the joint pdf of X1 and X2 ; or, in the
discrete case, it is:

f2 (x2 ) =
f (x1 , x2 )
(5.22)
x1

These pdfs, f1 (x1 ) and f2 (x2 ), respectively represent the probabilistic characteristics of each random variable X1 and X2 considered in isolation, as opposed to f (x1 , x2 ) that represents the joint probabilistic characteristics when
considered together. The formal denitions are given as follows:

Denition: Marginal pdfs

Let X = (X1 , X2 ) be a 2-dimensional random variable with a joint
pdf f (x1 , x2 ); the marginal probability distribution function of X1
alone, and of X2 alone, are dened as the following functions:

f1 (x1 ) =
f (x1 , x2 )dx2
(5.23)

and
f2 (x2 ) =

f (x1 , x2 )dx1

(5.24)

for continuous random variables, and, for discrete random variables, as the functions:

f1 (x1 ) =
f (x1 , x2 )
(5.25)
x2

and
f2 (x2 ) =

f (x1 , x2 )

(5.26)

Each marginal pdf possesses all the usual properties of pdfs, i.e., for continuous random variables,

146

Random Phenomena

1. f1 (x1 ) 0; and f2 (x2 ) 0

2. f1 (x1 )dx1 = 1; and f2 (x2 )dx2 = 1

3. P (X1 A) = A f1 (x1 )dx1 ; and P (X2 A) = A f2 (x2 )dx2

with the integrals are replaced with sums for the discrete case. An illustrative example follows.
Example 5.3 MARGINAL DISTRIBUTIONS OF CONTINUOUS BIVARIATE RANDOM VARIABLE
Find the marginal distributions of the joint pdfs given in Example 5.2
for characterizing the reliability of the commercial polymer reactors
temperature control system. Recall that the component random variables are X1 , the lifetimes (in years) of the control hardware electronics,
and X2 , the lifetime of the control valve on the cooling water line; the
joint pdf is as given in Eq (5.11):

f (x1 , x2 ) =

1 (0.2x1 +0.1x2 )
e
;
50

0 < x1 <
0 < x2 <
elsewhere

Solution:
(1) For this continuous bivariate random variable, we have from Eq
(5.17) that:

1 (0.2x1 +0.1x2 )
dx2
e
f1 (x1 ) =
50
0

1 0.2x1
1
=
e0.1x2 dx2 = e0.2x1
(5.27)
e
50
5
0
Similarly, from Eq (5.21), we have,

1 (0.2x1 +0.1x2 )
f2 (x2 ) =
dx1
e
50
0

1 0.1x2
1 0.1x2
=
e0.2x1 dx1 =
e
e
50
10
0

(5.28)

As an exercise, the reader should conrm that each of these marginal distributions is a legitimate pdf in its own right.
These ideas extend directly to n > 2 random variables whose joint pdf
is given by f (x1 , x2 , , xn ). There will be n separate marginal distributions
fi (xi ); i = 1, 2, , n, each obtained by integrating (or summing) out every
other random variable except the one in question, i.e.,

f1 (x1 ) =

f (x1 , x2 , , xn )dx2 dx3 dxn

(5.29)

Multidimensional Random Variables

or, in general,

fi (xi ) =

147

f (x1 , x2 , , xn )dx1 dx2 , dxi1 , dxi+1 , dxn

(5.30)
It is important to note that when n > 2, marginal distributions themselves
can be multivariate. For example, f12 (x1 , x2 ) is what is left of the joint pdf
f (x1 , x2 , , xn ) after integrating (or summing) over the remaining (n 2)
variables; it is a bivariate pdf of the two surviving random variables of interest.
The concepts are simple and carry over directly; however, the notation can
become quite confusing if one is not careful. We shall return to this point a
bit later in this chapter.

5.2.3

Conditional Distributions

If the joint pdf f (x1 , x2 ) of a bivariate random variable provides a description of how the two component random variables vary jointly; and if the
marginal distributions f1 (x1 ) and f2 (x2 ) describe how each random variable
behaves by itself, in isolation, without regard to the other; there remains yet
one more characteristic of importance: a description of how X1 behaves for
given specic values of X2 , and vice versa, how X2 behaves for specic values
of X1 (i.e., the probability distribution of X1 conditioned upon X2 taking on
specic values, and vice versa). Such conditional distributions are dened
as follows:

Denition: Conditional pdfs

Let X = (X1 , X2 ) be a 2-dimensional random variable, discrete
or continuous, with a joint pdf, f (x1 , x2 ), along with marginal
distributions f1 (x1 ) and f2 (x2 ); the conditional distribution of X1
given that X2 = x2 is dened as:
f (x1 |x2 ) =

f (x1 , x2 )
; f2 (x2 ) > 0
f2 (x2 )

(5.31)

Similarly, the conditional distribution of X2 given that X1 = x1 is

dened as:
f (x1 , x2 )
; f1 (x1 ) > 0
f (x2 |x1 ) =
(5.32)
f1 (x1 )

The similarity between these equations and the expression for conditional
probabilities of events dened as sets, as given in Eq (3.40) of Chapter 3
P (A|B) =

P (A B)
P (B)

(5.33)

148

Random Phenomena

should not be lost on the reader.

In Eq (5.31), the indicated pdf is a function of x1 , with x2 xed; it is a
straightforward exercise to show that this is a legitimate pdf. Observe that in
the continuous case,

f (x1 , x2 )dx1
(5.34)
f (x1 |x2 )dx1 =
f2 (x2 )

the numerator of which is recognized from Eq (5.21) as the marginal distribution of X2 so that:

f2 (x2 )
=1
(5.35)
f (x1 |x2 )dx1 =
f2 (x2 )

The same result holds for f (x2 |x1 ) in Eq (5.32) when integrated with respect
of x2 ; and, by replacing the integrals with sums, we obtain identical results
for the discrete case.
Example 5.4 CONDITIONAL DISTRIBUTIONS OF CONTINUOUS BIVARIATE RANDOM VARIABLE
Find the conditional distributions of the 2-dimensional random variables given in Example 5.2 for the reliability of a temperature control
system.
Solution:
Recall from the previous examples that the joint pdf is:
1 (0.2x +0.1x )
1
2
; 0 < x1 <
50 e
f (x1 , x2 ) =
0 < x2 <

0
elsewhere
Recalling the result obtained in Example 5.3 for the marginal pdfs
f1 (x1 ) and f2 (x2 ), the desired conditional pdfs are given as follows:
f (x1 |x2 )

1 (0.2x1 +0.1x2 )
e
50
1 0.1x2
e
10

1 0.2x1
e
5

(5.36)

and for the complementary conditional pdf f (x2 |x1 ):

f (x2 |x1 ) =

1 (0.2x1 +0.1x2 )
e
50
1 0.2x1
e
5

1 0.1x2
e
10

(5.37)

The reader may have noticed two things about this specic example: (i)
f (x1 |x2 ) is entirely a function of x1 alone, containing no x2 whose value is to
be xed; the same is true for f (x2 |x1 ) which is entirely a function of x2 , with
no dependence on x1 . (ii) In fact, not only is f (x1 |x2 ) a function of x1 alone;
it is precisely the same function as the unconditional marginal pdf f1 (x1 )
obtained earlier. The same is obtained for f (x2 |x1 ), which also turns out to

Multidimensional Random Variables

149

f(x1,x2)

1
2.0
0

1.5
0.0

0.5
x2

1.0

FIGURE 5.1: Graph of the joint pdf for the 2-dimensional random variable of Example
5.5

be the same as the unconditional marginal pdf f2 (x2 ) obtained earlier. Such
circumstances do not always occur for all 2-dimensional random variables, as
the next example shows; but the special cases where f (x1 |x2 ) = f1 (x1 ) and
f (x2 |x1 ) = f2 (x2 ) are indicative of a special relationship between the two
random variables X1 and X2 , as discussed later in this chapter.
Example 5.5 CONDITIONAL DISTRIBUTIONS OF ANOTHER CONTINUOUS BIVARIATE RANDOM VARIABLE
Find the conditional distributions of the 2-dimensional random variables whose joint pdf is given as follows:

x1 x2 ; 1 < x1 < 2
(5.38)
0 < x2 < 1
f (x1 , x2 ) =

0
elsewhere
shown graphically in Fig 5.1.
Solution:
To nd the conditional distributions, we must rst nd the marginal
distributions. (As an exercise, the reader may want to conrm that this
joint pdf is a legitimate pdf.) These marginal distributions are obtained
as follows:
1

1
x2
(x1 x2 )dx2 = x1 x2 2
(5.39)
f1 (x1 ) =
2
0
0
which simplies to give:

f1 (x1 ) =

(x1 0.5);
0;

1 < x1 < 2
elsewhere

(5.40)

150

Random Phenomena
Similarly,

f2 (x2 ) =

2
1

(x1 x2 )dx1 =

2

x21
x1 x2
2
1

(5.41)

which simplies to give:

f2 (x2 ) =

(1.5 x2 );
0;

0 < x2 < 1
elsewhere

(5.42)

Again the reader may want to conrm that these marginal pdfs are
legitimate pdfs.
With these marginal pdfs in hand, we can now determine the required conditional distributions as follows:
f (x1 |x2 ) =

(x1 x2 )
; 1 < x1 < 2;
(1.5 x2 )

(5.43)

f (x2 |x1 ) =

(x1 x2 )
; 0 < x2 < 1;
(x1 0.5)

(5.44)

and

(The reader should be careful to note that we did not explicitly impose
the restrictive conditions x2 = 1.5 and x1 = 0.5 in the expressions given
above so as to exclude the respective singularity points for f (x1 |x2 ) and
for f (x2 |x1 ). This is because the original space over which the joint
distribution f (x1 , x2 ) was dened, V = {(x1 , x2 ) : 1 < x1 < 2; 0 < x2 <
1}, already excludes these otherwise troublesome points.)
Observe now that these conditional distributions show mutual dependence of x1 and x2 , unlike in Example 5.4. In particular, say for
x2 = 1 (the rightmost edge of the x2 -axis of the plane in Fig 5.1), the
conditional pdf f (x1 |x2 ) becomes:
f (x1 |x2 = 1) = 2(x1 1); 1 < x1 < 2;

(5.45)

whereas, for x2 = 0 (the leftmost edge of the x2 -axis of the plane in Fig
5.1), this conditional pdf becomes
f (x1 |x2 = 0) =

2x1
; 1 < x1 < 2;
3

(5.46)

Similar arguments can be made for f (x2 |x1 ) and are left as an exercise
for the reader.

The following example provides a comprehensive illustration of these distributions specically for a discrete bivariate random variable.
Example 5.6 DISTRIBUTIONS OF DISCRETE BIVARIATE
RANDOM VARIABLE
An Apple computer store in a small town stocks only three types of
hardware components: low-end, mid-level and high-end, selling
respectively for $1600, $2000 and $2400; it also only stocks two types
of monitors, the 20-inch type, selling for $600, and the 23-inch type,

Multidimensional Random Variables

selling for $900. An analysis of sales records over a 1-year period (the
prices remained stable over the entire period) is shown in Table 5.1, indicating what fraction of the total sales is due to a particular hardware
component and monitor type. Each recorded sale involves one hardware
component and one monitor: X1 is the selling price of the hardware component; X2 the selling price of the accompanying monitor. The indicated
frequencies of occurrence of each sale combination can be considered
to be representative of the respective probabilities, so that Table 5.1
represents the joint distribution, f (x1 , x2 ).

TABLE 5.1:
pdf for computer
sales
X2 $600
X1
$1600 0.30
$2000 0.20
$2400 0.10

Joint
store
$900
0.25
0.10
0.05

(1) Show that f (x1 , x2 ) is a legitimate pdf and nd the sales combination (x1 , x2 ) with the highest probability, and the one with the lowest
probability.
(2) Obtain the marginal pdfs f1 (x1 ) and f2 (x2 ), and from these
compute P (X1 = $2000), regardless of X2 , (i.e., the probability of selling
a mid-level hardware component regardless of the monitor paired with
it). Also obtain P (X2 = $900) regardless of X1 , (i.e., the probability of
selling a 23-inch monitor, regardless of the hardware component with
which it is paired).
(3) Obtain the conditional pdfs f (x1 |x2 ) and f (x2 |x1 ) and determine
the highest value for each conditional probability; describe in words
what each means.
Solution:
(1) If f (x1 , x2 ) is a legitimate pdf, then it must hold that

f (x1 , x2 ) = 1
(5.47)
x2

From the joint pdf shown in the table, this amounts to adding up all
the 6 entries, a simple arithmetic exercise that yields the desired result.
The combination with the highest probability is seen to be X1 =
$1600; X2 = $400 since P (X1 = $1600; X2 = $400) = 0.3; i.e., the
probability is highest (at 0.3) that any customer chosen at random
would have purchased the low-end hardware (for $1600) and the 20inch monitor (for $600). The lowest probability of 0.05 is associated
with X1 = $2400 and X2 = $900, i.e., the combination of a high-end
hardware component and a 23-inch monitor.
(2) By denition, the marginal pdf f1 (x1 ) is given by:

f1 (x1 ) =
f (x1 , x2 )
(5.48)
x2

151

152

Random Phenomena
so that, from the table, f1 (1600) = 0.3 + 0.25 = 0.55; similarly,
f1 (2000) = 0.30 and f1 (2400) = 0.15. In the same manner, the values for f2 (x2 ) are obtained as f2 (600) = 0.30 + 0.20 + 0.10 = 0.60,
and f2 (900) = 0.4. These values are combined with the original joint
pdf into a new Table 5.2 to provide a visual representation of the relationship between these distributions. The required probabilities are

TABLE 5.2:
pdfs for computer
X2
X1
$1600
$2000
$2400
f2 (x2 )

Joint and marginal

store sales
$600 $900
f1 (x1 )
0.30 0.25
0.55
0.20 0.10
0.30
0.10 0.05
0.15
0.6
0.4
(1.0)

obtained directly from this table as follows:

P (X1 = $2000)

f1 (2000) = 0.30

(5.49)

P (X2 = $900)

f2 (900) = 0.40

(5.50)

(3) By denition, the desired conditional pdfs are given as follows:

f (x1 |x2 ) =

f (x1 , x2 )
f (x1 , x2 )
; and f (x2 |x1 ) =
f2 (x2 )
f1 (x1 )

(5.51)

and upon carrying out the indicated divisions using the numbers contained in Table 5.2, we obtain the result shown in Table 5.3 for f (x1 |x2 ),
and in Table 5.4 for f (x2 |x1 ). From these tables, we obtain the highest conditional probability for f (x1 |x2 ) as 0.625, corresponding to the
probability of a customer buying the low end hardware component
(X1 = $1600) conditioned upon having bought the 23-inch monitor
(X2 = $900); i.e., in the entire population of those who bought the 23inch monitor, the probability is highest at 0.625 that a low-end hardware
component was purchased to go along with the monitor. When the conditioning variable is the hardware component, the highest conditional
probability f (x2 |x1 ) is a tie at 0.667 for customers buying the 20-inch
monitor (X2 = $600) conditioned upon buying the mid-range hardware
(X1 = $2000), and those buying the high-end hardware (X1 = $2400).

Conditional pdf f (x1 |x2 ) for

computer store sales
X1
f (x1 |x2 = 600) f (x1 |x2 = 900)
$1600
0.500
0.625
0.333
0.250
$2000
0.167
0.125
$2400
Sum Total
1.000
1.000

TABLE 5.3:

Multidimensional Random Variables

153

Conditional pdf f (x2 |x1 ) for

computer store sales
X2
$600 $900 Sum Total
f (x2 |x1 = 1600) 0.545 0.455
1.000
1.000
f (x2 |x1 = 2000) 0.667 0.333
1.000
f (x2 |x1 = 2400) 0.667 0.333

TABLE 5.4:

5.2.4

General Extensions

As noted in the section on marginal distributions, it is conceptually

straightforward to extend the foregoing ideas and results to the general case
with n > 2. Such a general discussion, however, is susceptible to confusion
primarily because the notation can become muddled very quickly. Observe
that not only can the variables whose conditional distributions we seek be
multivariate, the conditioning variables themselves can also be multivariate
(so that the required marginal distributions are multivariate pdfs); and there
is always the possibility that there will be some variables left over that are
neither of interest, nor in the conditioning set.
To illustrate, consider the 5-dimensional random variable associated with
the Avandia clinical test: the primary point of concern that precipitated this
study was not so much the eectiveness of the drug in controlling blood sugar;
it is the potential adverse side-eect on cardiovascular function. Thus, the researchers may well be concerned with characterizing the pdf for blood pressure
X4 , X5 , conditioned upon cholesterol level X2 , X3 , leaving out X1 , blood sugar
level. Note how the variable of interest is bivariate, as is the conditioning variable. In this case, the desired conditional pdf is obtained as:
f (x4 , x5 |x2 , x3 ) =

f (x1 , x2 , x3 , x4 , x5 )
f23 (x2 , x3 )

(5.52)

where f23 (x2 , x3 ) is the bivariate joint marginal pdf for cholesterol level. We
see therefore that the principles transfer quite directly, and, when dealing with
specic cases in practice (as we have just done), there is usually no confusion.
The challenge is how to generalize without confusion.
To present the results in a general fashion and avoid confusion requires
adopting a dierent notation: using the vector X to represent the entire collection of random variables, i.e., X = (X1 , X2 , , Xn ), and then partitioning
this into three distinct vectors: X , the variables of interest (X4 , X5 in the
Avandia example given above); Y, the conditioning variables (X2 , X3 in the
Avandia example), and Z, the remaining variables, if any. With this notation,
we now have
f (x , y, z)
(5.53)
f (x |y) =
fy (y)
as the most general multivariate conditional distribution.

154

Random Phenomena

5.3

Distributional Characteristics of Jointly Distributed

Random Variables

The concepts of mathematical expectation and moments used to characterize the distribution of single random variables in Chapter 4 can be extended to
multivariate, jointly distributed random variables. Even though we now have
many more versions of pdfs to consider (joint, marginal and conditional), the
primary notions remain the same.

5.3.1

Expectations

The mathematical expectation of the function U (X) = U (X1 , X2 , , Xn )

of an n-dimensional continuous random variable with joint pdf f (x1 , x2 , , xn )
is given by:

U (x1 , x2 , , xn )f (x1 , x2 , , xn )dx1 dx2 dxn

E[U (X)] =

(5.54)
a direct extension of the single variable denition. The discrete counterpart
is:

E[U (X)] =

U (x1 , x2 , , xn )f (x1 , x2 , , xn )
(5.55)
x1

Example 5.7 EXPECTATIONS OF CONTINUOUS BIVARIATE RANDOM VARIABLE

From the joint pdf given in Example 5.2 for the reliability of the reactor
temperature control system, deduce which component is expected to fail
rst and by how long it is expected to be outlasted by the more durable
component.
Solution:
Recalling that the random variables for this system are X1 , the lifetime
(in years) of the control hardware electronics, and X2 , the lifetime of
the control valve on the cooling water line, observe that the function
U (X1 , X2 ), dened as
U (X1 , X2 ) = X1 X2

(5.56)

represents the dierential lifetimes of the two components; its expected

value provides the answer to both aspects of this question as follows:
By the denition of expectations,

1
(x1 x2 )e(0.2x1 +0.1x2 ) dx1 dx2 (5.57)
E[U (X1 X2 )] =
50 0
0
The indicated integrals can be evaluated several dierent ways. By expanding this expression into the dierence of two double integrals as

Multidimensional Random Variables

155

suggested by the multiplying (x1 x2 ), integrating out x2 in the rst

and x1 in the second, leads to:

1
1
E[U (X1 X2 )] =
x1 e0.2x1 dx1
x2 e0.1x2 dx2 ;
5 0
10 0
(5.58)
and upon carrying out the indicated integration by parts, we obtain:
E(X1 X2 ) = 5 10 = 5.

(5.59)

The immediate implication is that the expected lifetime dierential favors the control valve (lifetime X2 ) so that the control hardware electronic component is expected to fail rst, with the control valve expected
to outlast it by 5 years.
Example 5.8 EXPECTATIONS OF DISCRETE BIVARIATE
RANDOM VARIABLE
From the joint pdf given in Example 5.6 for the Apple computer store
sales, obtain the expected revenue from each recorded sale.
Solution:
Recall that for this problem, the random variables of interest are X1 ,
the cost of the computer hardware component, and X2 , the cost of the
monitor in each recorded sale. The appropriate function U (X1 , X2 ), in
this case is
(5.60)
U (X1 , X2 ) = X1 + X2
the total amount of money realized on each sale. By the denition of
expectations for the discrete bivariate random variable, we have

E[U (X1 , X2 )] =
(x1 + x2 )f (x1 , x2 )
(5.61)
x2

From Table 5.1, this is obtained as:

x1 f (x1 , x2 ) +
x2 f (x1 , x2 )
E(X1 + X2 ) =
x1

(0.55 1600 + 0.30 2000 + 0.15 2400)

+(0.60 600 + 0.40 900) = 2560

(5.62)

so that the required expected revenue from each sale is $2560.

In the special case where U (X) = e(t1 X1 +t2 X2 ) , the expectation, E[U (X)]
is the joint moment generating function, M (t1 , t2 ), for the bivariate random
variable X = (X1 , X2 ) dened by
(t X +t X )
e 1 1 2 2 f (x1 , x2 )dx1 dx2 ;
(t1 X1 +t2 X2 )
]=
M (t1 , t2 ) = E[e

(t1 X1 +t2 X2 )
f (x1 , x2 );
x1
x2 e
(5.63)
for the continuous and the discrete cases, respectively an expression that
generalizes directly for the n-dimensional random variable.

156

Random Phenomena

Marginal Expectations
Recall that for the general n-dimensional random variable X =
(X1 , X2 , Xn ), the single variable marginal distribution fi (xi ) is the distribution of the component random variable Xi alone, as if the others did not
exist. It is therefore similar to the single random variable pdf dealt with extensively in Chapter 4. As such, the marginal expectation of U (Xi ) is precisely
as dened in Chapter 4, i.e.,

U (xi )fi (xi )dxi
(5.64)
E[U (Xi )] =

for the continuous case, and, for the discrete case,

E[U (Xi )] =
U (xi )fi (xi )

(5.65)

In particular, when U (Xi ) = Xi , we obtain the marginal mean Xi , i.e.,

xi fi (xi )dxi ; continuous Xi
(5.66)
E(Xi ) = Xi =

x
f
(x
);
discrete
X
i
i
i
i
xi
All the moments (central and ordinary) dened for the single random variable
are precisely the same as the corresponding marginal moments for the multidimensional random variable. In particular, the marginal variance is dened
as

(xi Xi )2 fi (xi )dxi ; continuous Xi
2
2
Xi = E[(Xi Xi ) ] =

2
discrete Xi
xi (xi Xi ) fi (xi );
(5.67)
From the expression given for the joint MGF above in Eq (5.63), observe
that:
M (t1 , 0) = E[et1 X1 ]
M (0, t2 ) = E[et2 X2 ]

(5.68)
(5.69)

are, respectively, the marginal MGFs for f1 (x1 ) and for f2 (x2 ).
Keep in mind that in the general case, marginal distributions can be multivariate; in this case, the context of the problem at hand will make clear what
such a joint-marginal distribution will look like after the remaining variables
have been integrated out.
Conditional Expectations
As in the discussion about conditional distributions, it is best to deal with
the bivariate conditional expectations rst. For the bivariate random variable

Multidimensional Random Variables

157

X = (X1 , X2 ), the conditional expectation E[U (X1 )|X2 ] (i.e the expectation
of the function U (X1 ) conditioned upon X2 = x2 ) is obtained from the conditional distribution as follows:

U (x1 )f (x1 |x2 )dx1 ; continuous X
(5.70)
E[U (X1 )|X2 ] =

U
(x
)f
(x
|x
);
discrete
X
1
1 2
x1
with a corresponding expression for E[U (X2 )|X1 ] based on the conditional
distribution f (x2 |x1 ). In particular, when U (X1 ) = X1 (or, U (X2 ) = X2 ), the
result is the conditional mean dened by:

E(X1 |X2 ) = X1 |x2 =

x1 f (x1 |x2 )dx1 ; continuous X

xi

x1 f (x1 |x2 );

with a matching corresponding expression for X2 |x1 .

Similarly, if
U (X1 ) = (X1 X1 |x2 )2
we obtain the conditional variance,
2
= E[(X1 X1 |x2 )2 ] =
X
1 |x2

2
X
1 |x2

(5.72)

as:

(x1 X1 |x2 )2 f (x1 |x2 )dx1 ;

(5.71)

discrete X

2
x1 (x1 X1 |x2 ) f (x1 |x2 );

(5.73)

respectively for the continuous and discrete cases.

These concepts can be extended quite directly to general n-dimensional
random variables; but, as noted earlier, one must be careful to avoid confusing
notation.

5.3.2

Covariance and Correlation

Consider the 2-dimensional random variable X = (X1 , X2 ) whose marginal

means are given by X1 and X2 , and respective marginal variances 12 and
22 ; the quantity
(5.74)
12 = E[(X1 X1 )(X2 X2 )]
is known as the covariance of X1 with respect to X2 ; it is a measure of the
mutual dependence of variations in X1 and in X2 . It is straightforward to
show from Eq (5.74) that
12 = E(X1 X2 ) X1 X2

(5.75)

A popular and more frequently used measure of this mutual dependence is

the scaled quantity:
12
=
(5.76)
1 2

158

Random Phenomena

where 1 and 2 are the positive square roots of the respective marginal
variances of X1 and X2 . is known as the correlation coecient, with the
attractive property that
1 1

(5.77)

The most important points to note about the covariance, 12 , or the correlation coecient, are as follows:
1. 12 will be positive if values of X1 > X1 are generally associated with
values of X2 > X2 , or when values of X1 < X1 tend to be associated with values of X2 < X2 . Such variables are said to be positively
correlated and will be positive ( > 0), with the strength of the correlation indicated by the absolute value of : weakly correlated variables
will have low values close to zero while strongly correlated variables will
have values close to 1. (See Fig 5.2.) For perfectly positively correlated
variables, = 1.
2. The reverse is the case when 12 is negative: for such variables, values
of X1 > X1 appear preferentially together with values of X2 < X2 ,
or else values of X1 < X1 tend to be associated more with values of
X2 > X2 . In this case, the variables are said to be negatively correlated
and will be negative ( < 0); once again, with the strength of correlation indicated by the absolute values of . (See Fig 5.3). For perfectly
negatively correlated variables, = 1.
3. If the behavior of X1 has little or no bearing with that of X2 , as one
might expect, 12 and will tend to be close to zero (See Fig 5.4); and
when the two random variables are completely independent of each
other, then both 12 and will be exactly zero.
This last point brings up the concept of stochastic independence.

5.3.3

Independence

Consider a situation where electronic component parts manufactured at

two dierent plant sites are labeled 1 for plant site 1, and 2 for the other.
After combining these parts into one lot, each part is drawn at random and
tested: if found defective, it is labeled 0; otherwise it is labeled 1. Now
consider the 2-dimensional random variable X = (X1 , X2 ) where X1 is the
location of the manufacturing site (1 or 2), and X2 is the after-test status
of the electronic component part (0 or 1). If after many such draws and tests,
we discover that whether or not the part is defective has absolutely nothing
to do with where it was manufactured, (i.e., a defective part is just as likely
to come from one plant site as the other), we say that X1 is independent of
X2 . A formal denition now follows:

Multidimensional Random Variables

159

0
1

5
X1

FIGURE 5.2: Positively correlated variables: = 0.923

0
1

5
X1

FIGURE 5.3: Negatively correlated variables: = 0.689

160

Random Phenomena
35

10
5
1

5
X1

FIGURE 5.4: Essentially uncorrelated variables: = 0.085

Denition: Stochastic Independence

Let X = (X1 , X2 ) be a 2-dimensional random variable, discrete or
continuous; X1 and X2 are independent if the following conditions
hold:
1. f (x2 |x1 ) = f2 (x2 );
2. f (x1 |x2 ) = f1 (x1 ); and
3. f (x1 , x2 ) = f1 (x1 )f2 (x2 )

The rst point indicates that the distribution of X2 conditional on X1 is

identical to the unconditional (or marginal) distribution of X2 . In other words,
conditioning on X1 has no eect on the distribution of X2 , indicating that X2
is independent of X1 . However, this very fact (that X2 is independent of X1 )
also immediately implies the converse: that X1 is independent of X2 (i.e., that
the independence in this case is mutual). To establish this, we note that, by
denition, in Eq (5.32),
f (x2 |x1 ) =

f (x1 , x2 )
f1 (x1 )

(5.78)

However, when X2 is independent of X1 ,

f (x2 |x1 ) = f2 (x2 )

(5.79)

i.e., point 1 above, holds; as a consequence, by replacing f (x2 |x1 ) in Eq (5.78)

Multidimensional Random Variables

161

above with f2 (x2 ), we obtain:

f (x1 , x2 ) = f1 (x1 )f2 (x2 )

(5.80)

which, rst of all, is item 3 in the denition above, but just as importantly,
when substituted into the numerator of the expression in Eq (5.31), i.e.,
f (x1 |x2 ) =

f (x1 , x2 )
f2 (x2 )

when the conditioning is now on X2 , reduces this equation to

f (x1 |x2 ) = f1 (x1 )

(5.81)

which is item number 2 above indicating that X1 is also independent

of X2 . The two variables are therefore said to be mutually stochastically
independent.
Let us now return to a point made earlier after Example 5.4. There we
noted that the distributional characteristics of the random variables X1 , the
lifetime (in years) of the control hardware electronics, and X2 , the lifetime
of the control valve on the cooling water line, were such that they satised
conditions now recognizable as the ones given in points 1 and 2 above. It is
therefore now clear that the special relationship between these two random
variables alluded to back then is that they are stochastically independent.
Note that the joint pdf, f (x1 , x2 ), for this system is a product of the two
marginal pdfs, as in condition 3 above. This is not the case for the random
variables in Example 5.5.
The following example takes us back to yet another example encountered
earlier.
Example 5.9 INDEPENDENCE OF TWO DISCRETE RANDOM VARIABLES
Return to the two-coin toss experiment discussed in Example 5.1. From
the joint pdf obtained for this bivariate random variable (given in Eq
(5.7)), show that the two random variables, X1 (the number of heads
obtained in the rst toss), and X2 (the number of heads obtained in the
second toss), are independent.
Solution:
By denition, and from the results in that example, the marginal distributions are obtained as follows:

f (x1 , x2 )
f1 (x1 ) =
x2

f (x1 , 0) + f (x1 , 1)

(5.82)

so that f1 (0) = 1/2; f1 (1) = 1/2. Similarly,

f (x1 , x2 )
f2 (x2 ) =
x1

f (0, x2 ) + f (1, x2 ) = 1/2

(5.83)

162

Random Phenomena

TABLE 5.5:
pdfs for two-coin
Example 5.1
X2
X1
0
1
f2 (x2 )

Joint and marginal

toss problem of
0

1/4 1/4
1/4 1/4
1/2 1/2

so that f2 (0) = 1/2; f2 (1) = 1/2; i.e.,

1/2;
1/2;
f1 (x1 ) =

1/2;
1/2;
f2 (x2 ) =

f1 (x1 )
1/2
1/2
1

x1 = 0
x1 = 1
otherwise

(5.84)

x2 = 0
x2 = 1
otherwise

(5.85)

If we now tabulate the joint pdf and the marginal pdfs, we obtain the
result in Table 5.5. It is now clear that for all x1 and x2 ,
f (x1 , x2 ) = f1 (x1 )f2 (x2 )

(5.86)

so that these two random variables are independent.

Of course, we know intuitively that the number of heads obtained
in the rst toss should have no eect on the number of heads obtained
in the second toss, but this fact has now been established theoretically.

The concept of independence is central to a great deal of the strategies

for solving problems involving random phenomena. The ideas presented in
this section are therefore used repeatedly in upcoming chapters in developing
models, and in solving many practical problems.
The following is one additional consequences of stochastic independence.
If X1 and X2 are independent, then
E[U (X1 )G(X2 )] = E[U (X1 )]E[G(X2 )]

(5.87)

An immediate consequence of this fact is that in this case,

= 0
= 0

(5.88)
(5.89)

since, by denition,
12 = E[(X1 X1 ).(X2 X2 )]

(5.90)

and, by virtue of Eq (5.87), independence implies:

12 = E[(X1 X1 )].E[(X2 X2 )] = 0

(5.91)

Multidimensional Random Variables

163

It also follows that = 0 since it is 12 /1 2 .

A note of caution: it is possible for E[U (X1 )G(X2 )] to equal the product
of expectations, E[U (X1 )]E[G(X2 )] by chance, without X1 and X2 being independent; however, if X1 and X2 are independent, then Eq. (5.87) will hold.
This expression is therefore a necessary but not sucient condition.
We must exercise care in extending the denition of stochastic independence to the n-dimensional random variable where n > 2. The random variables X1 , X2 , , Xn are said to be mutually stochastically independent if
and only if,
n

fi (xi )
(5.92)
f (x1 , x2 , , xn ) =
i=1

where f (x1 , x2 , , xn ) is the joint pdf, and fi (xi ); i = 1, 2, , n, are the

n individual marginal pdfs. On the other hand, these random variables are
pairwise stochastically independent if every pair Xi , Xj ; i = j, is stochastically
independent.
Obviously, mutual stochastic independence implies pairwise stochastic independence, but not vice versa.

5.4

Summary and Conclusions

The primary objective of this chapter was to extend the ideas presented in
Chapter 4 for the single random variable to the multidimensional case, where
the outcome of interest involves two or more random variables simultaneously.
With such higher-dimensional random variables, it became necessary to introduce a new variety of pdfs dierent from, but still related to, the familiar one
encountered in Chapter 4: the joint pdf to characterize joint variation among
the variables; the marginal pdfs to characterize individual behavior of each
variable in isolation from others; and the conditional pdfs, to characterize the
behavior of one random variable conditioned upon xing the others at prespecied values. This new array of pdfs provide the full set of mathematical
tools for characterizing various aspects of multivariate random variables much
as the f (x) of Chapter 4 did for single random variables.
The possibility of two or more random variables co-varying simultaneously, which was not of concern with single random variables, led to the introduction of two additional and related quantities, co-variance and correlation,
with which one quanties the mutual dependence of two random variables.
This in turn led to the important concept of stochastic independence, that
one random variable is entirely unaected by another. As we shall see in subsequent chapters, when dealing with multiple random variables, the analysis
of joint behavior is considerably simplied if the random variables in question

164

Random Phenomena

are independent. We shall therefore have cause to recall some of the results of
this chapter at that time.
Here are some of the main points of this chapter again.
A multivariate random variable is dened in the same manner as a single
random variable, but the associated space, V , is higher-dimensional;
The joint pdf of a bivariate random variable, f (x1 , x2 ), shows how
the probabilities are distributed over the two-dimensional random variable space; the joint cdf, F (x1 , x2 ), represents the probability, P (X1 <
x1 ; X2 < x2 ); they both extend directly to higher-dimensional random
variables.
In addition to the joint pdf, two other pdfs are needed to characterize
multi-dimensional random variables fully:
Marginal pdf : fi (xi ) characterizes the individual behavior of each
random variable, Xi , by itself, regardless of the others;
Conditional pdf : f (xi |xj ) characterizes the behavior of Xi conditioned upon Xj taking on specic values.
These pdfs can be used to obtain such random variable characteristics
as joint, marginal and conditional expectations.
The covariance of two random variables, X1 and X2 , dened as
12 = E[(X1 X1 )(X2 X2 )]
(where X1 and X2 , are respective marginal expectations), provides
a measure of the mutual dependence of variations in X1 and X2 . The
related correlation coecient, the scaled quantity:
=

12
1 2

(where 1 and 2 are the positive square roots of the respective marginal
variances of X1 and X2 ), has the property that 1 1, with || indicating the strength of the mutual dependence, and the sign indicating
the direction (negative or positive).
Two random variables, X1 and X2 , are independent if the behavior of
one has no bearing on the behavior of the other; more formally,
f (x1 |x2 ) = f1 (x1 ); f (x2 |x1 ) = f2 (x2 );
so that,
f (x1 , x2 ) = f (x1 )f (x2 )

Multidimensional Random Variables

165

REVIEW QUESTIONS
1. What characteristic of the Avandia clinical test makes it relevant to the discussion of this chapter?
2. How many random variables at a time can the probability machinery of Chapter
4 deal with?
3. In dealing with several random variables simultaneously, what are some of the
questions to be considered that were not of concern when dealing with single random
variables in Chapter 4?
4. Dene a bivariate random variable formally.
5. Informally, what is a bivariate random variable?
6. Dene a multivariate random variable formally.
7. State the axiomatic denition of the joint pdf of a discrete bivariate random variable and of its continuous counterpart.
8. What is the general relationship between the cdf, F (x1 , x2 ), of a continuous bivariate random variable and its pdf, f (x1 , x2 )? What conditions must be satised
for this relationship to exist?
9. Dene the marginal distributions, f1 (x1 ) and f2 (x2 ), for a two-dimensional random variable with a joint pdf f (x1 , x2 ).
10. Do marginal pdfs possess the usual properties of pdfs or are they dierent?
11. Given a bivariate joint pdf, f (x1 , x2 ), dene the conditional pdfs, f (x1 |x2 ) and
f (x2 |x1 ).
12. In what way is the denition of a conditional pdf similar to the conditional
probability of events A and B dened on a sample space, ?
13. Dene the expectation, E[(U (X1 , X2 )], for a bivariate random variable. Extend
this to an n-dimensional (multivariate) random variable.
14. Dene the marginal expectation, E[(U (Xi )], for a bivariate random variable.
Extend this to an n-dimensional (multivariate) random variable.
15. Dene the conditional expectations, E[(U (X1 )|X2 ] and E[(U (X2 )|X1 ], for a bivariate random variable.
16. Given two random variables, X1 and X2 , dene their covariance.
17. What is the relationship between covariance and the correlation coecient?

166

Random Phenomena

18. What does a negative correlation coecient indicate about the relationship between two random variables, X1 and X2 ? What does a positive correlation coecient
indicate?
19. If the behavior of the random variable, X1 , has little bearing on that of X2 , how
will this manifest in the value of the correlation coecient, ?
20. When the correlation coecient of two random variables, X1 and X2 , is such
that || 1, what does this indicate about the random variables?
21. What does it mean that two random variables, X1 and X2 , are stochastically
independent?
22. If two random variables are independent, what is the value of their covariance,
and of their correlation coecient?
23. When dealing with n > 2 random variables, what is the dierence between
pairwise stochastic independence and mutual stochastic independence? Does one
always imply the other?

EXERCISES
Sections 5.1 and 5.2
5.1 Revisit Example 5.1 in the text and dene the two-dimensional random variable
(X1 , X2 ) as follows: X1 is the total number of heads, and X2 is the total number of tails. Obtain the space, V , and determine the complete pdf, f (x1 , x2 ), for
x1 = 0, 1, 2; x2 = 0, 1, 2, assuming equiprobable outcomes in the original sample
space.
5.2 The two-dimensional random variable (X1 , X2 ) has the following joint pdf:
f (1, 1) = 14 ;
f (1, 2) = 18 ;
1
f (1, 3) = 16
;

f (2, 1) =
f (2, 2) =
f (2, 3) =

3
8
1
8
1
16

(i) Determine the following probabilities: (a) P (X1 X2 ); (b) P (X1 + X2 = 4); (c)
P (|X2 X1 | = 1); (d) P (X1 + X2 is even).
(ii) Obtain the joint cumulative distribution function, F (x1 , x2 ).
5.3 In a game of chess, one player either wins, W , loses, L, or draws, D (either by
mutual agreement with the opponent, or as a result of a stalemate). Consider a
player participating in a two-game, pre-tournament qualication series:
(i) Obtain the sample space, .
(ii) Dene the two-dimensional random variable (X1 , X2 ) where X1 is the total
number of wins, and X2 is the total number of draws. Obtain V and, assuming
equiprobable outcomes in the original sample space, determine the complete joint
pdf, f (x1 , x2 ).
(iii) If the player is awarded 3 points for a win, 1 point for a draw and no point for a
loss, dene the random variable Y as the total number of points assigned to a player

Multidimensional Random Variables

167

at the end of the two-game preliminary round. If a player needs at least 4 points to
qualify, determine the probability of qualifying.
5.4 Revisit Exercise 5.3 above but this time consider three players: Suzie, the superior player for whom the probability of winning a game, pW = 0.75, the probability of
drawing, pD = 0.2 and the probability of losing, pL = 0.05; Meredith, the mediocre
player for whom pW = 0.5; pD = 0.3; PL = 0.2; and Paula, the poor player, for
whom pW = 0.2; pD = 0.3; PL = 0.5. Determine the complete joint pdf for each
player, fS (x1 , x2 ), for Suzie, fM (x1 , x2 ), for Meredith, and fP (x1 , x2 ), for Paula;
and from these, determine for each player, the probability that she qualies for the
tournament.
5.5 The continuous random variables X1 and X2 have the joint pdf

cx1 x2 (1 x2 ); 0 < x1 < 2; 0 < x2 < 1
f (x, y) =
0;
elsewhere

(5.93)

(i) Find the value of c if this is to be a valid pdf.

(ii) Determine P (1 < x1 < 2; 0.5 < x2 < 1) and P (x1 > 1; x2 < 0.5).
(iii) Determine F (x1 , x2 ).
5.6 Revisit Exercise 5.5.
(i) Obtain the marginal pdfs f1 (x1 ), f2 (x2 ), and the marginal means, X1 , X2 . Are
X1 and X2 independent?
(ii) Obtain the conditional pdfs f (x1 |x2 ) and f (x2 |x1 ).
5.7 The joint pdf f (x1 , x2 ) for a two-dimensional random variable is given by the
following table:
X1
X2
0
1
2

0
0
1/4

0
1/2
0

1/4
0
0

(i) Obtain the marginal pdfs, f1 (x1 ) and f2 (x2 ), and determine whether or not X1
and X2 are independent.
(ii) Obtain the conditional pdfs f (x1 |x2 ) and f (x2 |x1 ). Describe in words what these
results imply in terms of the original experiments and these random variables.
(iii) It is conjectured that this joint pdf is for an experiment involving tossing a fair
coin twice, with X1 as the total number of heads, and X2 as the total number of
tails. Are the foregoing results consistent with this conjecture? Explain.
5.8 Given the joint pdf:

f (x1 , x2 ) =

ce(x1 +x2 ) ;
0;

0 < x1 < 1; 0 < x2 < 2;

elsewhere

(5.94)

First obtain c, then obtain the marginal pdfs f1 (x1 ) and f2 (x2 ), and hence determine
whether or not X1 and X2 are independent.

168

Random Phenomena

5.9 If the range of validity of the joint pdf in Exercise 5.8 and Eq (5.94) are modied
to 0 < x1 < and 0 < x2 < , obtain c and the marginal pdf, and then determine
whether or not these random variables are now independent.
Section 5.3
5.10 Revisit Exercise 5.3. From the joint pdf determine
(i) E[U (X1 , X2 ) = X1 + X2 ].
(ii) E[U (X1 , X2 ) = 3X1 + X2 ]. Use this result to determine if the player will be
expected to qualify or not.
5.11 For each of the three players in Exercise 5.4,
(i) Determine the marginal pdfs, f1 (x1 ) and f2 (x2 ) and the marginal means
X1 , X2 .
(ii) Determine E[U (X1 , X2 ) = 3X1 + X2 ] and use the result to determine which of
the three players, if any, will be expected to qualify for the tournament.
5.12 Determine the covariance and correlation coecient for the two random variables whose joint pdf, f (x1 , x2 ) is given in the table in Exercise 5.7.
5.13 For each of the three chess players in Exercise 5.4, Suzie, Meredith, and Paula,
and from the joint pdf of each players performance at the pre-tournament qualifying
games, determine the covariance and correlation coecients for each player. Discuss
what these results imply in terms of the relationship between wins and draws for
each player.
5.14 The joint pdf for two random variables X and Y is given as:

x + y; 0 < x < 1; 0 < y < 1;
f (x, y) =
0;
elsewhere

(5.95)

(i) Obtain f (x|y and f (y|x) and show that these two random variables are not
independent.
(ii) Obtain the covariance, XY , and the correlation coecient, . Comment on the
strength of the correlation between these two random variables.

APPLICATION PROBLEMS
5.15 Refer to Application Problem 3.23 in Chapter 3, where the relationship between
a blood assay used to determine lithium concentration in blood samples and lithium
toxicity in 150 patients was presented in a table reproduced here for ease of reference.

Assay
A+
A
Total

Lithium
L+
30
21
51

Toxicity
L
17
82
92

Total
47
103
150

A+ indicates high lithium concentrations in the blood assay and A indicates

low lithium concentration; L+ indicates conrmed Lithium toxicity and L indicates
no lithium toxicity.

Multidimensional Random Variables

169

(i) In general, consider the assay result as the random variable Y having two possible
outcomes y1 = A+ , and y2 = A ; and consider the true lithium toxicity status as
the random variable X also having having two possible outcomes x1 = L+ , and
x2 = L . Now consider that the relative frequencies (or proportions) indicated in
the data table can be approximately considered as close enough to true probabilities;
convert the data table to a table of joint probability distribution f (x, y). What is
the probability that the test method will produce the right result?
(ii) From the table of the joint pdf, compute the following probabilities and explain what they mean in words in terms of the problem at hand: f (y2 |x2 ); f (y1 |x2 );
f (y2 |x1 ).
5.16 The reliability of the temperature control system for a commercial, highly
exothermic polymer reactor presented in Example 5.2 in the text is known to depend
on the lifetimes (in years) of the control hardware electronics, X1 , and of the control
valve on the cooling water line, X2 ; the joint pdf is:
1 (0.2x +0.1x )
1
2
; 0 < x1 <
50 e
f (x1 , x2 ) =
0 < x2 <

0
elsewhere
(i) Determine the probability that the control valve outlasts the control hardware
electronics.
(ii) Determine the converse probability that the controller hardware electronics outlast the control valve.
(iii) If a component is replaced every time it fails, how frequently can one expect to
replace the control valve, and how frequently can one expect to replace the controller
hardware electronics?
(iv) If it costs $20,000 to replace the control hardware electronics and $10,000 to
replace the control valve, how much should be budgeted over the next 20 years for
keeping the control system functioning, assuming all other characteristics remain
essentially the same over this period?
5.17 In a major bio-vaccine research company, it is inevitable that workers are exposed to some hazardous, but highly treatable, disease causing agents. According
to papers led with the Safety and Hazards Authorities of the state in which the
facility is located, the treatment provided is tailored to the workers age, (the variable, X: 0 if younger than 30 years; 1 if 31 years or older), and location in the
facility (a surrogate for virulence of the proprietary strains used in various parts of
the facility, represented by the variable Y = 1, 2, 3 or 4. The composition of the
2,500 employees at the companys research headquarters is shown in the table below:
Location
Age
< 30
31

6%
17%

20%
14%

13%
12%

10%
8%

(i) If a worker is infected at random so that the outcome is the bivariate random
variable (X, Y ) where X has two outcomes, and Y has four, obtain the pdf f (x, y)
from the given data (assuming each worker in each location has an equal chance of
infection); and determine the marginal pdfs f1 (x) and f2 (y).

170

Random Phenomena

(ii) What is the probability that a worker in need of treatment was infected in
location 3 or 4 given that he/she is < 30 years old?
(iii) If the cost of the treating each infected worker (in dollars per year) is given by
the expression
C = 1500 100Y + 500X
(5.96)
how much should the company expect to spend per worker every year, assuming the
worker composition remains the same year after year?
5.18 A non-destructive quality control test on a military weapon system correctly
detects a aw in the central electronic guidance subunit if one exists, or correctly
accepts the system as fully functional if no aw exists, 85% of the time; it incorrectly
identies a aw when one does not exist (a false positive), 5% of the time, and
incorrectly fails to detect a aw when one exists (a false negative), 10% of the time.
When the test is repeated 5 times under mostly identical conditions, if X1 is the
number of times the test is correct, and X2 is the number of times it registers a false
positive, the joint pdf of these two random variables is given as:
f (x1 , x2 ) =

120
0.85x1 0.05x2
x1 !x2 !

(5.97)

(i) Why is no consideration given in the expression in Eq (5.97) to the third random
variable, X3 , the number of times the test registers a false negative?
(ii) From Eq (5.97), generate a 5 5 table of f (x1 , x2 ) for all the possible outcomes
and from this obtain the marginal pdfs, f1 (x1 ) and f2 (x2 ). Are these two random
variables independent?
(iii) Determine the expected number of correct test results regardless of the other
results; also determine the expected value of false positives regardless of other results.
(iv) What is the expected number of the total number of correct results and false
positives? Is this value the same as the sum of the expected values obtained in (iii)?
Explain.

Chapter 6
Random Variable Transformations

6.1
6.2

6.3
6.4

6.5

Introduction and Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Single Variable Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Practical Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.2 Continuous Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.3 General Continuous Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.4 Random Variable Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Cumulative Distribution Function Approach . . . . . . . . . . . . . .
The Characteristic Function Approach . . . . . . . . . . . . . . . . . . . . . . . . .
Bivariate Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
General Multivariate Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1 Square Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.2 Non-Square Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.3 Non-Monotone Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

171
172
173
173
175
176
177
177
179
181
184
184
185
188
188
189
190
192

From a god to a bull! a heavy descension!

it was Joves case.
From a prince to a prentice!
a low transformation!
that shall be mine; for in every thing
the purpose must weigh with the folly. Follow me, Ned.
King Henry the Fourth,
William Shakespeare (15641616)

Many problems of practical interest involve a random variable Y that is dened as a function of another random variable X, say according to Y = (X),
so that the characteristics of the one arise directly from those of the other via
the indicated transformation. In particular, if we already know the probability
distribution function for X as fX (x), it will be helpful to know how to determine the corresponding distribution function for Y . This chapter presents
techniques for characterizing functions of random variables, and the results,
important in their own right, become particularly useful in Part III where
probability models are derived for random phenomena of importance in engineering and science.
171

172

6.1

Random Phenomena

Introduction and Problem Definition

The problem of primary interest to us in this chapter may be stated as

follows:

Given a random variable X with pdf fX (x), we are interested in

deriving an expression for the corresponding pdf fY (y) for the
random variable Y related to X according to:
Y = (X)

(6.1)

More generally, given the n-dimensional random variable X =

(X1 , X2 , . . . , Xn ) with the joint pdf fX (x), we want to nd the corresponding
pdf fY (y), for the m-dimensional random variable Y = (Y1 , Y2 , . . . , Ym ) when
the two are related according to:
=

1 (X1 , X2 , . . . , Xn );

Y2 =
... =

2 (X1 , X2 , . . . , Xn );
...

m (X1 , X2 , . . . , Xn )

(6.2)

As demonstrated in later chapters, these results are extremely useful in deriving probability models for more complicated random variables from the
probability models of simpler ones.

6.2

Single Variable Transformations

We begin with the simplest case when Y is a function of a single variable

X
Y = (X); X VX

(6.3)

is a continuous function that transforms each point x in VX , the space over

which the random variable X is dened, to y, thereby mapping VX onto the
corresponding space VY for the resulting random variable Y . Furthermore, this
transformation is one-to-one in the sense that each point in VX corresponds
to one and only one point in VY . In this case, the inverse transformation,
X = (Y ); Y VY

(6.4)

Random Variable Transformations

173

exists and is also one-to-one. The procedure for obtaining fY (y) given fX (x)
is highly dependent on the nature of the random variable in question, being
more straightforward for the discrete case than for the continuous.

6.2.1

Discrete Case

When X is a discrete random variable, we have

fY (y) = P (Y = y) = P (X = (y)) = fX [(y)]; Y VY

(6.5)

We illustrate this straightforward result rst with the following simple example.
Example 6.1 LINEAR TRANSFORMATION OF A POISSON
RANDOM VARIABLE
As discussed in more detail in Part III, the discrete random variable X
having the following pdf:
fX (x) =

x e
; x = 0, 1, 2, 3, . . .
x!

(6.6)

is a Poisson random variable; it provides a useful model of random

phenomena involving the occurrence of rare events in a nite interval
of length, time or space. Find the pdf fY (y) for the random variable Y
related to X according to:
Y = 2X.
(6.7)
Solution:
First we note that the transformation in Eq (6.7) is one-to-one, mapping
VX = {0, 1, 2, 3, . . .} onto VY = {0, 2, 4, 6, . . .}; the inverse transformation is:
1
(6.8)
X= Y
2
so that from Eq (6.5) we obtain:
fY (y)

P (Y = y) = P (X = y/2)

y/2 e
; y = 0, 2, 4, 6, . . .
(y/2)!

(6.9)

Thus, under the transformation Y = 2X and fX (x) as given in Eq (6.6),

the desired pdf fY (y) is given by:
fY (y) =

y/2 e
; y = 0, 2, 4, 6, . . .
(y/2)!

(6.10)

A Practical Application
The number of times, X, that each cell in a cell culture divides in a time
interval of length, t, is a random variable whose specic value depends on
many factors both intrinsic (e.g. individual cell characteristics) and extrinsic

174

Random Phenomena

(e.g. media characteristics, temperature, oxygen). As discussed in Chapter 8,

the underlying random phenomenon matches well to that of the ideal Poisson
random variable, so that if is the mean rate of division per unit time associated with a particular cell population, the probability distribution of X is
given by Eq (6.11)
fX (x) =

(t)x e(t)
; x = 0, 1, 2, 3, . . .
x!

(6.11)

where, in terms of Eq (6.6),

= t

(6.12)

In many cases, however, the cell culture characteristic of interest is not so

much the number of times that each cell divides as it is the number of cells,
Y , in the culture after the passage of a specic amount of time. For each cell
in the culture, the relationship between these two random variables is given
by:
(6.13)
Y = 2X
The problem of interest is now to nd fY (y), the pdf of the number of cells
in the culture, given fX (x).
As with the simple example given above, note that the transformation in
(6.13), even though nonlinear, is one-to-one, mapping VX = {0, 1, 2, 3, . . .}
onto VY = {1, 2, 4, 8, . . .}; the inverse transformation is:
X = log2 Y

(6.14)

From here, we easily obtain the required fY (y) as:

fY (y) =

e log2 y
; y = 1, 2, 4, 8, . . .
(log2 y)!

(6.15)

a somewhat unwieldy-looking, but nonetheless valid, pdf that can be simplied

a bit by noting that:
log2 y = y log2 = y
(6.16)
if we dene
= log2

(6.17)

a logarithmic transformation of the Poisson parameter . Thus:

e(2 ) y
; y = 1, 2, 4, 8, . . .
(log2 y)!

fY (y) =

(6.18)

It is possible to conrm that the pdf obtained in Eq (6.18) for Y , the number
of cells in the culture after a time interval t is a valid pdf for which:

fY (y) = 1
(6.19)
y

Random Variable Transformations

175

since, from Eq (6.18)

fY (y)

= e

(2 )

22
23
2
+
+
+ ...
1+
1
2!
3!

= e(2 ) e(2

(6.20)

The mean number of cells in the culture after time t, E[Y ], can be shown (see
end-of-chapter Exercise 6.2) to be:
E[Y ] = e

(6.21)

which should be compared with E[X] = .

6.2.2

Continuous Case

When X is a continuous random variable, things are slightly dierent. In

addition to the inverse transformation given in Eq (23.50), let us dene the
function:
d
[(y)] = (y)
J=
(6.22)
dy
known as the Jacobian of the inverse transformation. If the transformation is
such that it is strictly monotonic (increasing or decreasing), then it can be
shown that:
fY (y) = fX [(y)]|J|; Y VY
(6.23)
The argument goes as follows: If FY (y) is the cdf for the new variable Y , then
FY (y) = P (Y y) = P ((X) y) = P (X (y)) = FX [(y)]

(6.24)

and by dierentiation, we obtain:

fY (y) =

dFY (y)
d
d
=
{FX [(y)]} = fX [(y)] {(y)}
dy
dy
dy

(6.25)

with the derivative on the RHS positive for a strictly monotonic increasing
function. It can be shown that if were monotonically decreasing, the expression in (6.24) will yield:
fY (y) = fX [(y)]

d
{(y)}
dy

(6.26)

with the derivative on the RHS as a negative quantity. Both results may be
combined into one as

d

fY (y) = fX [(y)] {(y)}
(6.27)
dy
as presented in Eq (6.23). Let us illustrate this with another example.

176

Random Phenomena
Example 6.2 LOG TRANSFORMATION OF A UNIFORM
RANDOM VARIABLE
The random variable X with the following pdf:

1; 0 < x < 1
(6.28)
fX (x) =
0; otherwise
is identied in Part III as the uniform random variable. Determine the
pdf for the random variable Y obtained via the transformation:
Y = ln X

(6.29)

Solution:
The transformation is one-to-one, maps VX = {0 < x < 1} onto VY =
{0 < y < }, and the inverse transformation is given by:
X = (y) = ey/ ; 0 < y < .

(6.30)

The Jacobian of the inverse transformation is:

1
J = (y) = ey/

(6.31)

Thus, from Eq (6.23) or Eq (6.27), we obtain the required pdf as:

fY (y) = fX [(y)]|J| =

1 y/
e
;

0<y<
otherwise

(6.32)

These two random variables and their corresponding models are discussed
more fully in Part III.

6.2.3

General Continuous Case

When the transformation Y = (X) is not strictly monotone, the result

given above is modied as follows: Let the function y = (x) possess a countable number of roots, xi , represented as a function of y as:
xi = 1
i (y) = i (y); i = 1, 2, 3, . . . , k

(6.33)

with corresponding Jacobians:

Ji =

d
{i (y)}
dy

(6.34)

then it can be shown that Eq (6.23) (or equivalently Eq (6.27)) becomes:

fY (y) =

k

i=1

Let us illustrate with an example.

fX [i (y)]|Ji |

(6.35)

Random Variable Transformations

177

Example 6.3 THE SQUARE OF A STANDARD NORMAL

RANDOM VARIABLE
The random variable X has the following pdf:
2
1
fX (x) = ex /2 ; < x <
2

(6.36)

Determine the pdf for the random variable Y obtained via the transformation:
(6.37)
y = x2
Solution:
Observe that this transformation, which maps the space VX =
< x < onto VY = 0 < y < , is not one-to-one; for all y > 0
there are two xs corresponding to each y, since the inverse transformation is given by:

(6.38)
x= y
The transformation thus has 2 roots for x:

1 (y) =

2 (y) = y

(6.39)

and upon computing the corresponding derivatives, Eq (6.35) becomes

y 1/2
y 1/2
+ fX ( y)
fY (y) = fX ( y)
2
2

(6.40)

which simplies to:

1
fY (y) = ey/2 y 1/2 ; 0 < y <
2

(6.41)

This important result is used later.

6.2.4

Random Variable Sums

Let us consider rst the case where the random variable transformation
involves the sum of two independent random variables, i.e.,
Y = (X1 , X2 ) = X1 + X2

(6.42)

where f1 (x1 ) and f2 (x2 ), are, respectively, the known pdfs of the random
variables X1 and X2 . Two approaches are typically employed in nding the
desired fY (y):
The cumulative distribution function approach;
The characteristic function approach.

178

Random Phenomena
x2

y= x1 + x2
Vy={(x1,x2): x1 + x2 y}
x1

FIGURE 6.1: Region of interest, VY , for computing the cdf of the random variable Y
dened as a sum of 2 independent random variables X1 and X2

The Cumulative Distribution Function Approach

This approach requires rst obtaining the cdf FY (y) (as argued in Eq
(6.24)), from where the desired pdf is obtained by dierentiation when Y is
continuous. In this case, the cdf FY (y) is obtained as:

FY (y) = P (Y y) =
f (x1 , x2 )dx1 dx2
(6.43)
VY

where f (x1 , x2 ) is the joint pdf of X1 and X2 , and, most importantly, the
region over which the double integration is being carried out, VY , is given by:
VY = {(x1 , x2 ) : x1 + x2 y}

(6.44)

as shown in Fig 6.1. Observe from this gure that the integration may be
carried out several dierent ways: if we integrate rst with respect to x1 , the
limits go from until we reach the line, at which point x1 = y x2 ; we then
integrate with respect to x2 from to . In this case, Eq (6.43) becomes:
yx2
f (x1 , x2 )dx1 dx2
(6.45)
FY (y) =

from where we may dierentiate with respect to y to obtain:

fY (y) =
f (y x2 , x2 )dx2

(6.46)

In particular, if X1 and X2 are independent so that the joint pdf is a product

of the individual marginal pdfs, we obtain:

f1 (y x2 )f2 (x2 )dx2
(6.47)
fY (y) =

Random Variable Transformations

179

If, instead, the integration in Eq (6.43) had been done rst with respect to x2
and then with respect to x1 , the resulting dierentiation would have resulted
in the alternative, and entirely equivalent, expression:

fY (y) =

f2 (y x1 )f1 (x1 )dx1

(6.48)

Integrals of this nature are known as convolutions of the functions f1 (x1 ) and
f2 (x2 ) and this is as far as we can go with a general discussion.
Thus, we have the general result that the pdf of the random variable
Y obtained as a sum of two independent random variables X1 and X2 is a
convolution of the two contributing pdfs f1 (x1 ) and f2 (x2 ) as shown in Eqs
(6.47) and (6.48).
Let us illustrate this with a classic example.
Example 6.4 THE SUM OF TWO EXPONENTIAL RANDOM VARIABLES
Given two stochastically independent random variables X1 and X2 with
pdfs:
1
f1 (x1 ) = ex1 / ; 0 < x1 <
(6.49)

f2 (x2 ) =

1 x2 /
e
; 0 < x2 <

(6.50)

Determine the pdf of the random variable Y = X1 + X2 .

Solution:
In this case, the required pdf is obtained from the convolution:
fY (y) =

1
2

e(yx2 )/ ex2 / dx2 ; 0 < y <

(6.51)

However, because x2 is non-negative, as x1 = y x2 must also be, the

limits on the integral have to be restricted to go from x2 = 0 to x2 = y;
so that:
y
1
fY (y) = 2
e(yx2 )/ ex2 / dx2 ; 0 < y <
(6.52)
0
Upon carrying out the indicated integral, we obtain the nal result:
fY (y) =

1 y/
ye
;0 < y <
2

(6.53)

Observe that the result presented above for the sum of two random variables extends directly to the sum of more than two random variables by successive additions. However, this procedure becomes rapidly more tedious as we
must carry out repeated convolution integrals over increasingly more complex
regions.

180

Random Phenomena

The Characteristic Function Approach

It is far more convenient to employ the characteristic function to determine the pdf of random variable sums, continuous or discrete. The pertinent
result is from a property discussed earlier in Chapter 4: for independent random variables X1 and X2 with respective characteristic functions X1 (t) and
X2 (t), the characteristic function of their sum Y = X1 + X2 is given by:
Y (t) = X1 (t)X2 (t)

(6.54)

In general, for n independent random variables Xi ; i = 1, 2, . . . , n, each with

respective characteristic functions, Xi (t), if
Y = X1 + X2 + . . . + Xn

(6.55)

Y (t) = X1 (t)X2 (t) Xn (t)

(6.56)

then
The utility of this result lies in the fact that Y (t) is easily obtained from each
contributing Xi (t); the desired fY (y) is then recovered from Y (t) either by
inspection (when this is obvious), or else by the inversion formula presented
in Chapter 4.
Let us illustrate this with the same example used above.
Example 6.5 THE SUM OF TWO EXPONENTIAL RANDOM VARIABLES REVISITED
Using characteristic functions, determine the pdf of the random variable
Y = X1 + X2 , where the pdfs of the two stochastically independent random variables X1 and X2 are as given in Example 6.4 above and their
characteristic functions are given as:
X1 (t) = X2 (t) =

1
(1 jt)

(6.57)

Solution:
From Eq (6.54), the required characteristic function for the sum is:
Y (t) =

1
(1 jt)2

(6.58)

At this point, anyone familiar with specic random variable pdfs and
their characteristic functions will recognize this particular form right
away: it is the pdf of a gamma random variable, specically (2, ),
as Chapter 9 shows. However, since we have not yet introduced these
important random variables, their pdfs and characteristic functions (see
Chapter 9), we therefore do not expect the reader to be able to deduce
the pdf corresponding to Y (t) above by inspection. In this case we can
invoke the inversion formula of Chapter 4 to obtain:

1
ejyt Y (t)dt
fY (y) =
2

1
ejyt
=
dt
(6.59)
2 (1 jt)2

Random Variable Transformations

181

Upon carrying out the indicated integral, we obtain the nal result:
fY (y) =

1 y/
ye
;0 < y <
2

(6.60)

In general, it is not necessary to carry out the inversion integration explicitly

once one becomes familiar with characteristic functions of various pdfs. (To
engineers familiar with the application of Laplace transforms to the solution of
ordinary dierential equations, this is identical to how tables of inverse transforms have eliminated the need for explicitly carrying out Laplace inversions.)
This point is illustrated in the next anticipatory example (and in subsequent
chapters).
Example 6.6 REPRODUCTIVE PROPERTY OF GAMMA
RANDOM VARIABLE
A random variable, X, with the following pdf
f (x) =

1
ex/ x1 ; 0 < x <
()

(6.61)

is identied in Chapter 9 as a gamma random variable with parameters

and . Its characteristic function is:
X (t) =

1
(1 jt)

(6.62)

Find the pdf of the random variable Y dened as the sum of the n
independent such random variables, Xi , each with dierent parameters
i but with the same parameter .
Solution:
The desired transformation is
Y =

(6.63)

i=1

and from the given individual characteristic functions for each Xi , we

obtain the required characteristic function for the sum Y as:
Y (t) =

n

i=1

Xi (t) =

1
(1 jt)

(6.64)

where = n
i=1 i . Now, by comparing Eq (6.62) with Eq (6.64), we
see immediately the important result that Y is also a gamma random
variable, with parameters and . Thus, this sum of gamma random
variables begets another gamma random variable, a result generally
known as the reproductive property of the gamma random variable.

182

6.3

Random Phenomena

Bivariate Transformations

Because of its many practical applications, it is instructive to consider rst

the bivariate case before taking on the full multivariate problem. In this case
we are concerned with determining the joint pdf fY (y) for the 2-dimensional
random variable Y = (Y1 , Y2 ) obtained from the 2-dimensional random variable X = (X1 , X2 ) with the known joint pdf fX (x), via the following bivariate
transformation:
Y1 = 1 (X1 , X2 )
Y2 = 2 (X1 , X2 )

(6.65)

Y = (X)

(6.66)

written more compactly as:

As in the single variable case, we consider rst the case where these functions
are continuous and collectively dene a one-to-one transformation that maps
the two-dimensional space VX in the x1 x2 plane to the two-dimensional
space VY in the y1 y2 plane. The inverse bivariate transformation is given
by:
X1 = 1 (Y1 , Y2 )
X2 = 2 (Y1 , Y2 )

(6.67)

X = (Y)

(6.68)

or, more compactly,

The 2 2 determinant given by:

J =

x1
y1

x1
y2

x2
y1

x2
y2

(6.69)

is the Jacobian of this bivariate inverse transformation, and so long as J does

not vanish identically in VY , it can be shown that the desired joint pdf for Y
is given by:
(6.70)
fY (y) = fX [(y)]|J|; y VY
where the similarity with Eq (6.27) should not be lost on the reader. The
following is a classic example typically used to illustrate this result.
Example 6.7 RELATING GAMMA AND BETA RANDOM
VARIABLES
Given two stochastically independent random variables X1 and X2 with
pdfs:
1
(6.71)
f1 (x1 ) =
x1 ex1 ; 0 < x1 <
() 1

Random Variable Transformations

183

1
x1 ex2 ; 0 < x2 <
(6.72)
() 1
Determine both the joint and the marginal pdfs for the two random
variables Y1 and Y2 obtained via the transformation:
f2 (x2 ) =

Y1 = X1 + X2
X1
Y2 =
X1 + X2

(6.73)

Solution:
First, by independence, the joint pdf for X1 and X2 is:
1
x1 x11 ex1 ex2 ; 0 < x1 < ; 0 < x2 <
()() 1
(6.74)
Next, observe that the transformation in Eq (6.73) is a one-to-one
mapping of VX , the positive quadrant of the x1 x2 plane, onto
VY = {(y1 , y2 ); 0 < y1 < , 0 < y2 < 1}; the inverse transformation is
given by:
fX (x1 , x2 ) =

y1 y2

y1 (1 y2 )

and the Jacobian is obtained as

y2
J =
1 y2

y1
= y1
y1

(6.75)

(6.76)

It vanishes at the point y1 = 0, but this is a point of probability measure

0 that can be safely excluded from the space VY . Thus, from Eq (14.32),
the joint pdf for Y1 and Y2 is:

1
1
[y1 (1 y2 )]1 ey1 y1 ; 0 < y1 < ;
()() (y1 y2 )
fY (y1 , y2 ) =
0 < y2 < 1;

0
otherwise
(6.77)
This may be rearranged to give:

1
1

y2 (1 y2 )1 ) ey1 y1+1 ; 0 < y1 < ;

()()
fY (y1 , y2 ) =
0 < y2 < 1;

0
otherwise
(6.78)
an equation which, apart from the constant, factors out into separate
and distinct functions of y1 and y2 , indicating that the random variables
Y1 and Y2 are independent.
By denition, the marginal pdf for Y2 is obtained by integrating out
y1 in Eq (6.78) to obtain

1
ey1 y1+1 dy1
(6.79)
f2 (y2 ) =
y21 (1 y2 )1 )
()()
0
Recognizing the integral as the gamma function, i.e.,

(a) =
ey y a1 dy
0

(6.80)

184

Random Phenomena
we obtain:
f2 (y2 ) =

( + ) 1
y
(1 y2 )1 ; 0 < y2 < 1
()() 2

(6.81)

Since, by independence,
fY (y1 , y2 ) = f1 (y1 )f2 (y2 )

(6.82)

it follows from Eqs (6.78), (6.71) or (6.72), and Eq (15.82) that the
marginal pdf for Y1 is given by:
f1 (y1 ) =

1
ey1 y1+1 ; 0 < y1 <
( + )

(6.83)

Again, we refer to these results later in Part III.

6.4

General Multivariate Transformations

As introduced briey earlier, the general multivariate case is concerned

with determining the joint pdf fY (y) for the m-dimensional random variable Y= (Y1 , Y2 , . . . , Ym ) arising from a transformation of the n-dimensional
random variable X= (X1 , X2 , . . . , Xn ) according to:
=
=

1 (X1 , X2 , . . . , Xn );
2 (X1 , X2 , . . . , Xn );

... =
Ym =

...
m (X1 , X2 , . . . , Xn )

Y1
Y2

given the joint pdf fX (x).

6.4.1

Square Transformations

When n = m and the transformation is one-to-one, and the inverse transformation:

=
=

1 (y1 , y2 , . . . , yn );
2 (y1 , y2 , . . . , yn );

... =
xn =

...
n (y1 , y2 , . . . , yn )

x1
x2

(6.84)

or, more compactly,

X = (Y)

(6.85)

Random Variable Transformations

yields the square n n determinant:
x
x1
1

y1 y2

x2 x2
J = y1 y2
..
..
.
.
xn xn
y
y
1

x1
yn

..
.

x2
yn

..
.

xn
yn

185

(6.86)

And now, as in the bivariate case, it can be shown that for a J that is non-zero
anywhere in VY , the desired joint pdf for Y is given by:
fY (y) = fX [(y)]|J|; y VY

(6.87)

an expression that is identical in every way to Eq (14.32) except for the dimensionality, and similar to the single variate result in Eq (6.23). Thus for
the square transformation in which n = m, the required result is a direct
generalization of the bivariate result, identical in structure, diering only in
dimensionality.

6.4.2

Non-Square Transformations

The case with n = m presents two dierent problems:

1. n < m; the overdened transformation in which there are more new
variables Y than the original X variables;
2. n > m; the underdened transformation in which there are fewer new
variables Y than the original X variables.
In the overdened problem, it should be easy to see that there can be
no exact inverse transformation except under some special, very restrictive
circumstances, in which the extra (m n) Y variables are merely redundant
and can be expressed as functions of the other n. This problem is therefore of
no practical interest: the general case has no exact solution; the special case
reverts to the already solved square n n problem.
With the underdened problem the more common of the two the
strategy is to augment the m equations with an additional (m n), usually
simple, variable transformations chosen such that an inverse transformation
exists. Having thus squared the problem, the result in Eq (6.87) may then
be applied to obtain a joint pdf for the augmented Y variables. The nal step
involves integrating out the extraneous variables. This is best illustrated with
some examples.
Example 6.8 SUM OF TWO STANDARD NORMAL RANDOM VARIABLES
Given two stochastically independent random variables X1 and X2 with
pdfs:
x2
1
1
f1 (x1 ) = e 2 ; < x1 <
(6.88)
2

186

Random Phenomena
2

x2
1
f2 (x2 ) = e 2 ; < x2 <
2

(6.89)

determine the pdf of the random variable Y obtained from their sum,
Y = X1 + X2

(6.90)

Solution:
First, observe that even though this is a sum, so that we could invoke
earlier results to handle this problem, Eq (6.90) is also an underdetermined transformation from two dimensions in X1 and X2 to one in Y .
To square the transformation, let the variable in Eq (6.90) now be Y1
and add another one, say Y2 = X1 X2 , to give:
Y1

X1 + X2

X1 X2

(6.91)

which is now square, and one-to-one. The inverse transformation is:

1
(y1 + y2 )
2
1
(y1 y2 )
2

(6.92)

and a Jacobian, J = 1/2.

By independence, the joint pdf for X1 and X2 is given by:
1
e
fX (x1 , x2 ) =
2

2

x +x2
1

; < x1 < ; < x2 <

(6.93)

and from Eq (6.87), the joint pdf for Y1 and Y2 is obtained as:

fY (y1 , y2 ) =

1 1
e
2 2

(y1 +y2 )2 +(y1 y2 )2

; < y1 < ; < y2 <

(6.94)

which rearranges easily to:

1
e
fY (y1 , y2 ) =
4

2
y
1
4

2
y
2
4

; < y1 < ; < y2 < (6.95)

And now, either by inspection (this is a product of two clearly identiable, separate and distinct functions of y1 and y2 , indicating that the
two variables are independent), or by integrating out y2 in Eq (6.95),
one easily obtains the required marginal pdf for Y1 as:
2

y1
1
f1 (y1 ) = e 4 ; < y1 <
2

(6.96)

In the next example we derive one more important result and illustrate the
seriousness of the requirement that the Jacobian of the inverse transformation
not vanish anywhere in VY .

Random Variable Transformations

187

Example 6.9 RATIO OF TWO STANDARD NORMAL RANDOM VARIABLES

Given two stochastically independent random variables X1 and X2 with
pdfs:
x2
1
1
f1 (x1 ) = e 2 ; < x1 <
(6.97)
2
2

x2
1
f2 (x2 ) = e 2 ; < x2 <
2

(6.98)

determine the pdf of the random variable Y obtained from their ratio,
Y = X1 /X2

(6.99)

Solution:
Again, because this is an underdetermined transformation, we must rst
augment it with another one, say Y2 = X2 , to give:
Y1

X1
X2
X2

(6.100)

which is now square, one-to-one, and with the inverse transformation:

The Jacobian,

y1 y2

y2
J =
0

(6.101)

y1
= y2
1

(6.102)

vanishes at the single point y2 = 0, however; and even though this is a

point of probability measure zero, the observation is worth keeping in
mind.
From Example 6.8 above, the joint pdf for X1 and X2 is given by:
fX (x1 , x2 ) =

1
e
2

2
x1 +x2
2
2

; < x1 < ; < x2 <

(6.103)

from where we now obtain the joint pdf for Y1 and Y2 as:
fY (y1 , y2 ) =

1
|y2 |e
2

2 2
2
y1 y2 +y2
2

< y1 < ;

(6.104)

< y2 < 0; 0 < y2 <

The careful reader will notice two things: (i) the expression for fY involves not just y2 , but its absolute value |y2 |; and (ii) that we have
excluded the troublesome point y2 = 0 from the space VY . These two
points are related: to the left of the point y2 = 0, |y2 | = y2 ; to the
right, |y2 | = y2 , so that these two regions must be treated dierently in
evaluating the integral.

188

Random Phenomena
To obtain the marginal pdf for y1 we now integrate out y2 in Eq
(6.104) over the appropriate region in VY as follows:

0
(y12 +1)y22
(y12 +1)y22
1

2
2
f1 (y1 ) =
y2 e
dy2 +
y2 e
dy2
2

0
(6.105)
which simplies to:

1
1
; < y1 <
(6.106)
f1 (y1 ) =
(1 + y12 )
as the required pdf. It is important to note that in carrying out the
integration implied in (6.105), the nature of the absolute value function, |y2 |, naturally forced us to exclude the point y2 = 0 because it
made it impossible for us to carry out the integration from to
under a single integral. (Had the integral involved not |y2 |, but y2 , as
an instructive exercise, the reader should try to evaluate the resulting
integral from to . See Exercise 6.9.)

6.4.3

Non-Monotone Transformations

In general, when the multivariate transformation y = (x) may be nonmonotone but has a countable number of roots k, when written as the matrix
version of Eq (6.33), i.e.,
xi = 1
i (y) = i (y); i = 1, 2, 3, . . . , k

(6.107)

if each inverse transformation i is square, with a non-zero Jacobian Ji , then

it can be shown that:
fY (y) =

fX [i (y)]|Ji |

(6.108)

i=1

which is a multivariate extension of Eq (6.35).

6.5

Summary and Conclusions

We have focussed attention in this chapter on the single problem of determining the pdf, fY (y), of a random variable Y that has been dened as
a function of another random variable, X, whose pdf fX (x) is known. As is
common with problems of such general construct, the approach used to determine the desired pdf depends on the nature of the random variable, as
well as the nature of the problem itselfin this particular case, the problem

Random Variable Transformations

189

being generally more straightforward to solve for discrete random variables

that for continuous ones. When the transformation involves random variable
sums, it is much easier to employ the method of characteristic functions, regardless of whether the random variables involved are discrete or continuous.
But beyond the special care that must be taken for continuous non-monotone
transformations, the underlying principle is the same for all cases and is fairly
straightforward.
The primary importance of this chapter lies in the fact that it provides one
of the tools (and much of the foundational results) employed routinely in deriving probability models for some of the more complex random phenomena.
We will therefore rely on much of this chapters material in subsequent chapters, especially in Chapters 8 and 9 where we derive models for a wide variety
of specic randomly varying phenomena. As such, the reader is encouraged
to tackle a good number of the exercises and problems found at the end of
this chapter; solving these problems will make the upcoming discussions much
easier to grasp at a fundamental level.
Here are some of the main points of the chapter again.
Given a random variable X with pdf fX (x), and the random variable
transformation, Y = (X), the corresponding pdf fY (y) for the random
variable Y is obtained directly from the inverse transformation, X =
(Y ) for the discrete random variable; for continuous random variables,
d
[(Y )], is required in
the Jacobian of the inverse transformation, J = dy
addition.
When the transformation (X) involves sums, it is more convenient to
employ the characteristic function of X to determine fY (y).
When the transformation (X) is non-monotone, fY (y) will consist of a
sum of k components, where k is the total number of roots of the inverse
transformation.
When multivariate transformations are represented in matrix form, the
required results are matrix versions of the results obtained for single
variable transformations.

REVIEW QUESTIONS
1. State, in mathematical terms, the problem of primary interest in this chapter.
2. What are the results of this chapter useful for?
3. In single variable transformations, where Y = (X) is given along with fX (x),
and fY (y) is to be determined, what is the dierence between the discrete case of
this problem and the continuous counterpart?

190

Random Phenomena

4. What is the Jacobian of the single variable inverse transformation?

5. In determining fY (y) given fX (x) and the transformation Y = (X), how does
one handle the case where the transformation is not strictly monotone?
6. Which two approaches were presented in this chapter for nding pdfs of random
variable sums? Which of the two is more convenient?
7. What is meant by the convolution of two functions, f1 (x1 ) and f2 (x2 )?
8. Upon what property of characteristic functions is the characteristic function approach to the determination of the pdf of random variable sums based?
9. What is the Jacobian of a multivariate inverse transformation?
10. How are non-square transformations handled?

EXERCISES
6.1 The pdf of a random variable X is given as:
f (x) = p(1 p)x1 ; x = 1, 2, 3, . . . ,

(6.109)

(i) Obtain the pdf for the random variable Y dened as

Y =

1
X

(6.110)

(ii) Given that E(X) = 1/p, obtain E(Y ) and compare it to E(X).
6.2 Given the pdf shown in Eq (6.18) for the transformed variable, Y , i.e.,

fY (y) =

e(2 ) y
; y = 1, 2, 4, 8, . . .
(log2 y)!

show that E(Y ) = e and hence conrm Eq (6.21).

6.3 Consider the random variable, X, with the following pdf:
1 x/
e
; 0<x<

fX (x) =
0;
elsewhere

(6.111)

Determine the pdf for the random variable Y obtained via the transformation
Y =

1 X/
e

(6.112)

Compare this result to the one obtained in Example 6.2 in the text.
6.4 Given a random variable, X, with the following pdf:
1
(x + 1); 1 < x < 1
2
fX (x) =
0;
elsewhere

(6.113)

Random Variable Transformations

191

(i) Determine the pdf for the random variable Y obtained via the transformation
Y = X2

(6.114)

(ii) Determine E(X) and E(Y ).

6.5 Given the pdf for two stochastically independent random variables X1 and X2
as
ei xi i
(6.115)
f (xi ) =
; xi = 0, 1, 2, . . .
xi !
for i = 1, 2, and given the corresponding characteristic function as:
jt

Xi (t) = e[i (e

1)]

(6.116)

(i) Obtain the pdf fY (y) of the random variable Y dened as the sum of these two
random variables, i.e.,
Y = X1 + X2
(ii) Extend the result to a sum of n such random variables, i.e.,
Y = X1 + X2 + + Xn
with each distribution given in Eq (6.115). Hence, establish that the random variable
X also possesses the reproductive property illustrated in Example 6.6 in the text.
(iii) Obtain the pdf fZ (z) of the random variable Z dened as the average of n such
random variables, i.e.,
1
Z = (X1 + X2 + + Xn )
n

6.6 In Example 6.3 in the text, it was established that if the random variable X has
the following pdf:
2
1
(6.117)
fX (x) = ex /2 ; < x <
2
then the pdf for the random variable Y = X 2 is:
1
fY (y) = ey/2 y 1/2 ; 0 < y <
2

(6.118)

Given that the characteristic function of this random variable Y is:

Y (t) =

1
(1 j2t)1/2

(6.119)

by re-writing 2 as 21/2 , and as (1/2) (or otherwise), obtain the pdf fZ (z) of
the random variable dened as:
Z = X12 + X22 + Xr2

(6.120)

where the random variables, Xi , are all mutually stochastically independent, and
each has the distribution shown in Eq (6.117).

192

Random Phenomena

6.7 Revisit Example 6.8 in the text, but this time, instead of Eq (6.91), use the
following alternative squaring transformation,
Y2 = X2

(6.121)

You should obtain the same result.

6.8 Revisit Example 6.9 in the text, but this time, instead of Eq (6.100), use the
following alternative squaring transformation,
Y2 = X1

(6.122)

Which augmenting squaring transformation leads to an easier problemthis one, or

the one in Eq (6.100) used in Example 6.8?
6.9 Revisit Eq (6.104), this time, replace |y2 | with y2 , and integrate the resulting
joint pdf fY (y1 , y2 ) with respect to y2 over the entire range < y2 < . Compare
your result with Eq (6.106) and comment on the importance of making sure to use
the absolute value of the Jacobian of the inverse transformation in deriving deriving
pdfs of continuous transformed variables.

APPLICATION PROBLEMS
6.10 In a commercial process for manufacturing the extruded polymer lm Mylar ,
each roll of the product is characterized in terms of its gage, the lm thickness,
X. For a series of rolls that meet the desired mean thickness target of 350 m, the
thickness of a section of lm sampled randomly from a particular roll has the pdf

(x 350)2
1
(6.123)
f (x) = exp
2i2
i 2
where i2 is the variance associated with the average thickness for each roll, i. In
reality, the product property that is of importance to the end-user is not so much
the lm thickness, or even the average lm thickness, but a roll-to-roll consistency,
quantied in terms of a relative thickness variability measure dened as
2

X 350
(6.124)
Y =
i
Obtain the pdf fY (y) that is used to characterize the roll-to-roll variability observed
in this product quality variable.
6.11 Consider an experimental, electronically controlled, mechanical tennis ball
launcher designed to be used to train tennis players. One such machine is positioned at a xed launch point, L, located a distance of 1 m from a wall as shown in
Fig 6.2. The launch mechanism is programmed to launch the ball in an essentially
straight line, at an angle that varies randomly according to the pdf:

c; 2 < < 2
(6.125)
f () =
0; elsewhere
where c is a constant. The point of impact on the wall, at a distance y from the

Random Variable Transformations

193

y
1

FIGURE 6.2: Schematic diagram of the tennis ball launcher of Problem 6.11
center, will therefore be a random variable whose specic value depends on . First
show that c = , and then obtain fY (y).
6.12 The distribution of residence times in a single continuous stirred tank reactor
(CSTR), whose volume is V liters and through which reactants ow at rate F
liters/hr, was established in Chapter 2 as the pdf:
f (x) =

1 x/
;0 < x <
e

(6.126)

where = V /F .
(i) Find the pdf fY (y) of the residence time, Y , in a reactor that is 5 times as large,
given that in this case,
Y = 5X
(6.127)
(ii) Find the pdf fZ (z) of the residence time, Z, in an ensemble of 5 reactors in
series, given that:
(6.128)
Z = X1 + X2 + + X5
where each reactors pdf is as given in Eq (6.126), with parameter, i ; i = 1, 2, . . . , 5.
(Hint: Use results of Examples 6.5 and 6.6).
(iii) Show that even if 1 = 2 = = 5 = for the ensemble of 5 reactors in series,
fZ (z) will still not be the same as fY (y).
6.13 The total number of aws (dents, scratches, paint blisters, etc) found on the
various sets of doors installed on brand new minivans in an assembly plant is a
random variable with the pdf:
f (x) =

e x
; x = 0, 1, 2, . . .
x!

(6.129)

The value of the pdf parameter, , depends on the door in question as follows:
= 0.5 for the driver and front passenger doors; = 0.75 for the two bigger midsection passenger doors, and = 1.0 for the fth, rear trunk/tailgate door. If the
total number of aws per completely assembled minivan is Y , obtain the pdf fY (y)
and from it, compute the probability of assembling a minivan with more than a total
number of 2 aws on all its doors.
6.14 Let the uorescence signals obtained from a test spot and the reference spot

194

Random Phenomena

on a microarray be represented as random variables X1 and X2 respectively. Within

reason, these variables can be assumed to be independent, with the following pdfs:
f1 (x1 ) =

1
x1 ex1 ; 0 < x1 <
() 1

(6.130)

f2 (x2 ) =

1
x1 ex2 ; 0 < x2 <
() 1

(6.131)

It is customary to analyze such microarray data in terms of the fold change ratio,
Y =

X1
X2

(6.132)

indicative of the fold increase (or decrease) in the signal intensity between test
and reference conditions. Show that the pdf of Y is given by:
f (y) =

( + )
y 1
; y > 0; > 0; > 0
()() (1 + y)+

(6.133)

6.15 The following expression is used to calibrate a thermocouple whose natural

output is V volts; X is the corresponding temperature, in degrees Celsius.
X = 0.4V + 100

(6.134)

in a range from 50 to 500 volts and 100 to 250 C. If the voltage output is subject
to random variability around the true value V , such that

1
(v V )2
exp
(6.135)
f (v) =
2V2
V 2
where the mean (i.e., expected) value for Voltage, E(V ) = V and the variance,
V ar(V ) = V2 , (i) Show that:
E(X)
V ar(X)

0.4V + 100

(6.136)

0.16V2

(6.137)

2
(ii) In terms of E(X) = X and V ar(x) = X
, obtain an expression for the pdf
fX (x) representing the variability propagated to the temperature values.

6.16 Propagation-of-errors studies are concerned with determining how the errors from one variable are transmitted to another when the two variables are related
according to a known expression. When the relationships are linear, it is often possible to obtain complete probability distribution functions for the dependent variable
given the pdf for the independent variable (see Problem 6.15). When the relationships are nonlinear, closed form expressions are not always possible; in terms of
general results, the best one can hope for are approximate expressions for the expected value and variance of the dependent variable, typically in a local region,
upon linearizing the nonlinear expression. The following is an application of these
principles.
One of the best known laws of bioenergetics, Kleibers law, states that the Resting Energy Expenditure of an animal, Q0 , (essentially the animals metabolic rate,

Random Variable Transformations

195

in kcal/day), is proportional to M 3/4 , where M is the animals mass (in kg). Specifically for mature homeotherms, the expression is:
Q0 = 70M 3/4

(6.138)

Consider a particular population of homeotherms for which the variability in mass

is characterized by the random variable M with the distribution:

(m M )2
1
exp
(6.139)
f (m) =
2
2M
M 2
2
with a mean value, M , and variance M
. The pdf representing the corresponding
variation in Q0 can be obtained using the usual transformation techniques, but the
result does not have a convenient, recognizable, closed form. However, it is possible
to obtain approximate values for E(Q0 ) and V ar(Q0 ) in the neighborhood around
the mean mass, M , and the corresponding metabolic rate, Q0 .
Given that a rst-order (linear) Taylor series approximation of the expression in
Eq (6.138) is dened as:

Q0
(M M )
(6.140)
Q0 Q0 +
M M =M

rst obtain the approximate linearized expression for Q0 when M = 75 kg, and
then determine E(Q0 ) and V ar(Q0 ) for a population with M = 12.5 kg under
these conditions.

196

Random Phenomena

Chapter 7
Application Case Studies I:
Probability

7.1
7.2

7.3

7.4

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mendel and Heredity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 Background and Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.2 Single Trait Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.3 Single trait analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The First Generation Traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Probability and The Second Generation Traits . . . . . . . . . . . . . . . .
7.2.4 Multiple Traits and Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pairwise Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.5 Subsequent Experiments and Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
World War II Warship Tactical Response Under Attack . . . . . . . . . . . . . . . . .
7.3.1 Background and Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.2 Approach and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.3 Final Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

198
199
199
200
201
203
204
205
205
208
209
209
209
212
212

But to us, probability is the very guide of life.

Bishop Butler (16921752)

To many scientists and engineers, a rst encounter with the theory of probability in its modern axiomatic form often leaves the impression of a subject
matter so abstract and esoteric in nature as to be entirely suited to nothing
but the most contrived applications. Nothing could be further from the truth.
In reality, the application of probability theory features prominently in many
modern elds of study: from nance, economics, sociology and psychology to
various branches of physics, chemistry, biology and engineering, providing a
perfect illustration of the aphorism that there is nothing so practical as a
good theory.
This chapter showcases the applicability of probability theory through
two specic case studies involving real-world problems whose practical importance can hardly be overstated. The rst, Mendels deduction of the laws
of hereditythe basis for the modern science of geneticsshows how Mendel
employed probability (and the concept of stochastic independence) to establish the principles underlying a phenomenon which, until then, was considered
essentially unpredictable and hence not susceptible to systematic analysis.
The second is from a now-declassied US Navy study during World War
II and involves decision-making in the face of uncertainty, using past data. It
197

198

Random Phenomena

illustrates the application of frequency-of-occurrence information, viewed as

approximate total and conditional probabilities, to solve an important tactical
military problem.

7.1

Introduction

The elegant, well-established and fruitful tree we now see as modern probability theory has roots that reach back to 16th and 17th century gamblers
and the very realand very practicalneed for reliable solutions to numerous
gambling problems. Referring to these gambling problems by the somewhat
less morally questionable term problems on games of chance, some of the
most famous and most gifted mathematicians of the day devoted considerable
energy rst to solving specic problems (most notably the Italian mathematician, Cardano, in the 16th century), and later to developing the foundational
basis for systematic mathematical analysis (most notably the Dutch scientist,
Huygens, and the French mathematicians, Pascal and Fermat, in the 17th
century). However, despite subsequent major contributions in the 18th century from the likes of Jakob Bernoulli (1654-1705) and Abraham de Moivre
(1667-1754), it was not until the 19th century, with the publication in 1812
of Laplaces book, Theorie Analytique des Probabilites, that probability theory moved beyond the mathematical analysis of games of chance to become
recognized as an important branch of mathematics in its own rightone with
broader applications to other scientic and practical problems such as statistical mechanics and characterization of experimental error.
The nal step in the ascent of probability theory was taken in the 20th
century with the development of the axiomatic approach. First expounded in
Kolmogorovs celebrated 1933 monograph (the English translation, Foundations of Probability Theory, was published in 1950), this approach, once and
for all, provided a rigorous and mathematically precise denition of probability that suciently generalized the theory and formalized its applicability to a
wide variety of random phenomena. Paradoxically, that probability theory in
its current modern form enjoys applications in such diverse areas as actuarial
science, economics, nance; genetics, medicine, psychology; engineering, manufacturing, and strategic military decisions, is attributable to Kolmogorovs
rigorous theoretical and precise formalism. Thus, even though it would be considered overly hyperbolic today (too much embroidery), placed in its proper
historic context, the following statement in Laplaces book is essentially true:
It is remarkable that a science which began with the consideration
of games of chance should have become the most important object
of human knowledge.

Application Case Studies I: Probability

199

The two example case studies presented here illustrate just how important
probability theory and its application have become since the time of Laplace.

7.2

Mendel and Heredity

Heredity, how traits are transmitted from parent to ospring, has always
fascinated and puzzled mankind. This phenomenon, central to the propagation of life itself, with serious implications for the health, viability and survival of living organisms, remained mysterious and poorly understood until
the ground-breaking work by Gregor Mendel (1822-1884). Mendel, an Augustinian monk, arrived at his amazing conclusions by studying variations in pea
plants, using the garden of his monastery as his laboratory. As stated in an
English translation of the original paper published in 18661 . . . The object of
the experiment was to observe these variations in the case of each pair of differentiating characters, and to deduce the law according to which they appear
in successive generations. The experiments involved the careful cultivation
and testing of nearly 30,000 plants, lasting almost 8 years, from 1856 to 1863.
The experimental results, and their subsequent probabilistic analysis paved
the way for the modern science of genetics, but it was not recognized as such
right away. The work, and its monumental import, languished in obscurity
until the early 20th century when it was rediscovered and nally accorded its
well-deserved recognition.
What follows is an abbreviated discussion of the essential elements of
Mendels work and the probabilistic reasoning that led to the elucidation of
the mechanisms behind heredity and genetics.

7.2.1

Background and Problem Denition

The value and utility of any experiment are determined by the tness of
the material to the purpose for which it is used, and thus in the case before us
it cannot be immaterial what plants are subjected to experiment and in what
manner such experiment is conducted.
So wrote Mendel in motivating his choice of pea plants as the subject of
his now-famous set of experiments. The two primary factors that made pea
plants an attractive choice are:
1. Relatively fast rates of reproduction; and
1 Mendel, Gregor, 1866. Versuche u
ber Planzenhybriden. Verhandlungen des naturforschenden Vereines in Br
unn, Bd. IV f
ur das Jahr 1865, Abhandlungen, 347; rst translated into English by William Bateson in 1901 as Experiments in Plant Hybridization,:
see http://www.netspace.org./MendelWeb/.

200

Random Phenomena

2. Availability of many varieties, each producing denite and easy to characterize traits;
Before enumerating the specic traits that Mendel studied, it is important to
note in hindsight, that the choice of peas was remarkably fortuitous because
the genetic structure of peas is now known to be relatively simple. A more
complex genetic structure could have further obscured the fundamental principles with additional distracting details; and the deductive analysis required
to derive general laws governing heredity and genetics from this specic set of
results would have been far more dicult.
By tracking the following seven specic trait characteristics (with the variations manifested in each trait indicated in square brackets),
1. Seed Shape; [Round/Wrinkled]
2. Seed Albumen Color; [Yellow/Green]
3. Seed Coat (same as Flower); Color [Reddish/White]
4. Pod Form (or Texture); [Inated/Constricted]
5. Unripe Pod (or stalks) Color; [Green/Yellow]
6. Flower Position (on the stem); [Axial/Terminal]
7. Stem length; [Tall/Dwarfed]
Mendel sought to answer the following specic questions:
1. How are these seven traits transmitted from parents to osprings generation after generation?
2. Are there discernible patterns and can they be generalized?
Our discussion here is limited to two out of the many sets of experiments in
the original study:
1. Single trait experiments, in which individual traits and how they are
transmitted from parent to ospring in subsequent generations are
tracked one-at-a-time;
2. Multiple trait experiments, in which several traits and their transmission
are tracked simultaneously, specically focusing on pairwise experiments
involving two traits.

Application Case Studies I: Probability

7.2.2

201

Single Trait Experiments and Results

In this set of experiments Mendel focussed on individual traits (such as

seed shape) one-at-a-time, and tracked how they are transmitted from one
generation to the next. In all, seven dierent sets of experiments were performed, each set devoted to a single trait, but for the sake of the current
presentation, we will use seed shape as a representative example.
The experimental procedure and results are summarized as follows:
1. Initialization: Generation of pure traits parents (P). This was done by
fertilizing pollen from round seed shaped plants for many generations
until they stabilized and produced only round seed osprings consistently. The same procedure was repeated for wrinkled seeds. (The other
6 traits received the same treatment for each pair of associated trait
variations)
2. First generation hybrids (f1): Cross-fertilization of pure traits. Pure
parent round seed plants were cross-fertilized with wrinkled ones to produce rst generation hybrids. Results: from parents with separate and
distinct pure traits, every single seed in the rst generation of hybrids
was round, not a single wrinkled! (Identical corresponding results were
obtained for the other 6 traits, with one variant manifesting preferentially over the complementary pair).
3. Second generation hybrids (f2) from the rst generation. The rst generation plants were cross-fertilized among themselves. Result: approximately 3/4 of the seeds were round, with 1/4 wrinkled. (Identical corresponding results were obtained for the other 6 traits: the variant that
was exclusively manifested in the rst generation hybrids retained its
preferential status, and did so in the same approximate 3:1 ratio.)
The entire collection of results from all seven sets of experiments are summarized in Table 7.1.

7.2.3

Single trait analysis

To make sense of these surprising and somewhat counterintuitive results

required convincing answers to the following fundamental questions raised by
the data:
1. First generation visible trait uniformity: How does the cross-fertilization
of round seed plants with wrinkled seed ones produce a rst generation
of hybrid seeds that are all round? Or, in general, how does the crossfertilization of two very dierent, pure, and distinct traits produce rst
generation hybrid osprings that allwithout exceptionpreferentially
display only one of these parental traits?
2. Second generation visible trait variety: How does the cross-fertilization

8,023
929
1,181
580
858
1,064

Yellow
Reddish
Inated
Green
Axial
Tall

19,959

7,324

Round

Seed Shape
(Round/Wrinkled)
Seed Alb Color
(Yellow/Green)
Seed Coat/Flower
(Reddish/White)
Pod Form
(Inated/Constricted)
Unripe Pod Color
(Green/Yellow)
Flower Position
(Axial/Terminal)
Stem Length
(Tall/Dwarfed)

Totals

Total

1st Generation

14,949

787

651

428

882

705

6,022

5,474

Dominant

5010

277

207

152

299

224

2,001

0.749

0.740

0.759

0.738

0.747

0.759

0.751

2nd Generation
Proportion
Dominant (D)
1,850
0.747

Recessive

Summary of Mendels single trait experiment results

Characteristics

TABLE 7.1:

0.251

0.260

0.241

0.262

0.253

0.241

0.249

Proportion
Recessive (r)
0.253

2.98:1

2.84:1

3.14:1

2.82:1

2.95:1

3.15:1

3.01:1

2.96:1

D : r Ratio

202
Random Phenomena

Application Case Studies I: Probability

203

of rst generation plants (with only one trait uniformly on display) produce second generation plants displaying a variety entirely absent in the
homogenous rst generation? Or, alternatively, how did the missing trait
in the rst generation reappear in the second?
3. Second generation visible trait composition: What law governs the apparently constant numerical ratio with which the original parental traits
appear in the second generation? What is the true theoretical value
of this numerical ratio?
To answers these questions and elucidate the principles governing single trait
selection, Mendel developed the following concepts and demonstrated that
they were consistent with his experimental data:
1. The concept of Hereditary Factors:
The inheritance of each trait is determined by units or factors
(now called genes); these factors do not amalgamate, but are
passed on to osprings intact and unchanged;
An individual has two sets of such units or factors, inheriting one
set from each parent; thus each parent transmits only half of its
hereditary factors to each ospring;
Which of the two parental factors is inherited by an ospring is
purely a matter of chance.
2. The concept of Dominance/Recessiveness:
In heredity, one trait is always dominant over the other, this other
trait being the recessive one;
To show up, a dominant trait needs only one trait factor from
the parent; the recessive needs two;
A trait may not show up in an individual but its factor can still be
transmitted to the next generation.
Mendels postulate was that if these concepts are true, then one must obtain
the observed results; conversely one will obtain these results only if these
concepts are valid.
The First Generation Traits
To see how these concepts help resolve the rst problem, consider rst
the specic case of the seed shape: Let the factors possessed by the pure
round shaped parent be represented as RR, (each R representing one round
trait factor); similarly, let the factors of the pure wrinkled shaped parent be
represented as ww. In cross-fertilizing the round-seed plants with the wrinkledseed ones, each rst generation hybrid will have factors that are either Rw
or wR. And now, if the round trait is dominant over the wrinkled trait,

204

Random Phenomena

then observe that the entire rst generation will be all round, precisely as in
Mendels experiment.
In general, when a pure dominant trait with factors DD is cross-fertilized
with a pure recessive trait with factors rr, the rst generation hybrid will
have factors Dr or rD each one displaying uniformly the dominant trait, but
carrying the recessive trait. The concepts of hereditary factors (genes) and
of dominance thus enabled Mendel to resolve the problem of the uniform
display of traits in the rst generation; just as important, they also provided
the foundation for elucidating the principles governing trait selection in the
second and subsequent generations. This latter exercise is what would require
probability theory.
Probability and The Second Generation Traits
The key to the second generation trait manifestation is a recognition that
while each seed of the rst generation plants looks like the dominant roundseed type in the parental generation, there are some fundamental, but invisible, dierences: the parental generation has pure trait factors RR and ww; the
rst generation has two distinct trait factors: Rw (or wR), one visible (phenotype) because it is dominant, the other not visible but inherited nonetheless
(genotype). The hereditary but otherwise invisible trait is the key.
To analyze the composition of the second generation, the following is a
modernization of the probabilistic arguments Mendel used. First note that
the collection of all possible outcomes when cross-fertilizing two plants each
with a trait factor set Rw is given by:
= {RR, Rw , w R, ww}

(7.1)

From here, according to the theory of hereditary factors and dominance,

it should now be clear that there will be a mixture of round seeds as well as
wrinkled seeds. But because it is purely a matter of chance which factor is
passed on from the rst generation to the next, this set is rightly considered
the sample space of the experiment. (In fact, the phenomenon in question
is precisely akin to the idealized experiment in which a coin is tossed twice
and the number of heads and tails are recorded, for example with H as R,
and T as w.)
To determine the ratio in which these phenotypic traits will be displayed,
let the random variable of interest (in this case the manifested phenotypic
trait) be dened as follows:

0, Wrinkled
X=
(7.2)
1, Round
in which case,
VX = {0, 1};

(7.3)

If the theory of dominance is valid, and if there is an equiprobable chance

Application Case Studies I: Probability

205

for each trait combination, then from Vx and its pre-image in , Eq. (7.1),
the probability distribution function of the phenotypic manifestation random
variable, X, is given by
P (X = 0) = 1/4
P (X = 1) = 3/4

(7.4)
(7.5)

The second generation composition will therefore be a 3:1 ratio of round to

wrinkled seeds. (The same arguments presented above apply to all the other
single traits.)
Placed side-by-side with the experimental results shown in Table 7.1, these
probabilistic arguments are now seen to conrm all the postulated concepts
and theories. The fact that the dominant-to-recessive trait ratios observed experimentally did not come out to be precisely 3:1 in all the traits is of course
a consequence of the random uctuations intrinsic to all random phenomena. Note also how the ratios determined from larger experimental samples
( 7, 000 and 8, 000 respectively for shape and albumen color) are closer
to the theoretical value than the ratios obtained from much smaller samples
( 600 and 800 respectively for pod color and ower position). These facts
illustrate the fundamental dierence between empirical frequencies and theoretical probabilities: the former will not always match the latter exactly, but
the dierence will dwindle to nothingness as the sample size increases, with
the two coinciding in the limit of innite sample size. The observed results are
akin to the idealized experiment of tossing a fair coin twice, and determining
the number of time one obtains at least a Head. Theoretically, this event
should occur 3/4 of the time, but there will be uctuations.

7.2.4

Multiple Traits and Independence

The discussion thus far has been concerned with single traits and the
principles governing their hereditary transmission. Mendels next task was to
determine whether these principles applied equally to trait pairs, and then in
general when several diverse characters are united in the hybrid by crossing. The key question to be answered does the transmission of one trait
interfere with another, or are they wholly independentrequired a series of
carefully designed experiments on a large number of plants.
Pairwise Experiments
The rst category of multiple trait experiments involved cross-fertilization
of plants in which the dierentiating characteristics were considered in pairs.
For the purpose of illustration, we will consider here only the very rst in this
series, in which the parental plants diered in seed shape and seed albumen
color. Specically, the seed plants were round and yellow (R,Y), while the
pollen plants were wrinkled and green (w,g). (To eliminate any possible systematic pollen or seed eect, Mendel also performed a companion series of

206

Random Phenomena

experiments in which the roles of seed and pollen were reversed.) The specic
question the experiments were designed to answer is this: will the transmission
of the shape trait interfere with color or will they be independent?
As in the single trait experiments, the rst generation of hybrids were
obtained by cross-fertilizing the pure round-and-yellow seed plants with pure
wrinkled-and-green ones; the second generation plants were obtained by crossfertilizing rst generation plants, and so on, with each succeeding generation
similarly obtained from the immediately preceding one.
First generation results: The rst generation of fertilized seeds (f1)
were all round and yellow like the seed parents. These results are denitely
reminiscent of the single trait experiments and appeared to conrm that the
principle of dominance extended to pairwise traits independently: i.e. that
the round shape trait dominance over the wrinkled, and the yellow color trait
dominance over the green held true in the pairwise experiments just as they
did in the single trait experiments. Shape did not seem to interfere with color,
at least in the rst generation. But how about the second generation? How
will the bivariate shape/color traits manifest, and how will this inuence the
composition of the second generation hybrids? The circumstances are clearly
more complicated and require more careful analysis.
Second generation: Postulate, Theoretical Analysis and Results:
Rather than begin with the experimental results and then wend our way
through the theoretical analysis required to explain the observations, we nd it
rather more instructive to begin from a postulate, and consequent theoretical
analysis, and proceed to compare the theoretical predictions with experimental
data.
As with the single trait case, let us dene the following random variables:
for shape,

0, Wrinkled
(7.6)
X1 =
1, Round
and for color,

X2 =

0, Green
1, Yellow

(7.7)

As obtained previously, the single trait marginal pdfs for second generation
hybrid plants are given by:

1/4; x1 = 0
f1 (x1 ) =
(7.8)
3/4; x1 = 1
for shape, and, similarly, for color,

f2 (x2 ) =

1/4; x1 = 0
3/4; x1 = 1

We now desire the joint pdf f (x1 , x2 ).

(7.9)

Application Case Studies I: Probability

207

TABLE 7.2:

Theoretical distribution of shape-color traits

in second generation hybrids under the independence assumption
Shape (w/R) Color (g/Y) Prob Dist. Phenotype Trait
X2
f (x1 , x2 )
X1
0
0
1/16
(w,g)
(w,Y)
0
1
3/16
1
0
3/16
(R,g)
1
1
9/16
(R,Y)

Observe that the set of possible trait combinations is as follows:

= {(w,g), (w,Y), (R,g), (R,Y)}

(7.10)

giving rise to the 2-dimensional random variable space:

VX = {(0, 0), (0, 1), (1, 0), (1, 1)}.

(7.11)

Consider rst the simplest postulate that multiple trait transmissions are
independent. If this is true, then by denition of stochastic independence
the joint pdf will be given by:
f (x1 , x2 ) = f1 (x1 )f2 (x2 )

(7.12)

so that the theoretical distribution of the second generation hybrids will be as

shown in Table 7.2, predicting that the observed traits will be in the proportion
of 9:3:3:1 with the round-and-yellow variety being the most abundant, the
wrinkled-and-green the least abundant, and the wrinkled-and-yellow and the
round-and-green in the middle, appearing in equal numbers.
Mendels experimental results were summarized as follows:

The fertilized seeds appeared round and yellow like those of the
seed parents. The plants raised therefrom yielded seeds of four
sorts, which frequently presented themselves in one pod. In all,
556 seeds were yielded by 15 plants, and of these there were:
315 round and yellow,
101 wrinkled and yellow,
108 round and green,
32 wrinkled and green.

and a side-by-side comparison of the theoretical with the experimentally

observed distribution is now shown in Table 7.3.

208

Random Phenomena

TABLE 7.3:

Theoretical versus
experimental results for second generation
hybrid plants
Phenotype Theoretical Experimental
Distribution Frequencies
Trait
(R,Y)
0.56
0.57
0.19
0.18
(w,Y)
0.19
0.19
(R,g)
(w,g)
0.06
0.06

Since the experimental results matched the theoretical predictions remarkably well, the conclusion is that indeed the transmission of color is independent
of the transmission of the shape trait.

7.2.5

Subsequent Experiments and Conclusions

A subsequent series of experiments, rst on other pairwise traits (yielding

results similar to those shown above), and then on various simultaneous combinations of three, four and more multiple traits) led Mendel to conclude as
follows:

There is therefore no doubt that for the whole of the characters

involved in the experiments the principle applies that the ospring
of the hybrids in which several essentially dierent characters are
combined exhibit the terms of a series of combinations, in which
the developmental series for each pair of dierentiating characters
are united. It is demonstrated at the same time that the relation
of each pair of dierent characters in hybrid union is independent
of the other dierences in the two original parental stocks.

Almost a century and a half after, with probability theory now a familiar
xture in the scientic landscape, and with the broad principles and consequences of genetics part of popular culture, it may be dicult for modern
readers to appreciate just how truly revolutionary Mendels experiments and
his application of probability theory were. Still, it was the application of probability theory that made it possible for Mendel to predict, ahead of time, the
ratio between phenotypic (visible) occurrences of dominant traits and recessive traits that will arise from a given set of parent genotype (hereditary)
traits (although by todays standards all this may now appear routine and
straightforward). By thus unraveling the mysteries of a phenomenon that was
essentially considered unpredictable and due to chance alone plus some vague

Application Case Studies I: Probability

209

averaging process, Mendel is rightfully recognized for laying the foundation

for modern mathematical biology, inuencing the likes of R. A. Fisher, and
almost single-handedly changing the course of biological research permanently.

7.3

World War II Warship Tactical Response Under Attack

Unlike in the previous example where certain systematic underlying biological mechanisms were responsible for the observations, one often must
deal with random phenomena for which there are no such easily discernible
mechanisms behind the observations. Such is the case with the problem we
are about to discuss, involving Japanese suicide bomber plane attacks during
World War II. It shows how historical data sets were used to estimate empirical conditional probabilities and these probabilities subsequently used to
answer a very important question with signicant consequences for US Naval
operations at that crucial moment during the war.

7.3.1

Background and Problem Denition

During World War II, US warships attacked by Japanese kamikaze pilots had two mutually exclusive tactical options: sharp evasive maneuvers to
elude the attacker and confound his aim, making a direct hit more dicult
to achieve; or oensive counterattack using anti-aircraft artillery. The two options are mutually exclusive because the eectiveness of the counter attacking
aircraft guns required maintaining a steady coursepresenting an easier target for the incoming kamikaze pilot; sharp maneuvering warships on the other
hand are entirely unable to aim and deploy their anti-aircraft guns with much
eectiveness. A commitment to one option therefore immediately precludes
the other.
Since neither tactic was perfectly eective in foiling kamikaze attacks, and
since dierent types of warships cruisers, air craft carriers, destroyers, etc
appeared to experience varying degrees of success with the dierent options,
naval commanders needed a denitive, rational system for answering the question: When attacked by Japanese suicide planes, what is the appropriate tactic
for a US Warship, evasive maneuvers or oensive counterattack?

210

Random Phenomena

TABLE 7.4:

Attacks and hits on US WW II Naval Warships in 1943

Large Ships (L) Small Ships (S)
Total
TACTIC
Attacks Hits Attacks Hits Attacks Hits
Evasive maneuvers
36
8
144
52
180
60
Oensive Counterattack
61
30
124
32
185
62
Total
97
38
268
84
365
122

7.3.2

Approach and Results

The question was answered in a Naval department study commissioned in

1943 and published in 19462 although it was classied until about 19603.
Data: In the summer of 1943, 365 warships that had been attacked by
Kamikaze pilots provided the basis for the study and its recommendations.
The data record on these ships showed warship type (Aircraft carriers, Battleships, Cruisers, Destroyers and auxiliaries), the tactic employed (evasive or
oensive) and whether or not the ship was hit.
As in most cases, the raw data remains largely uninformative until appropriately reorganized. In this case, the warships were divided into two categories: Large (Aircraft carriers, Battleships, Cruisers) and Small (Destroyers and auxiliaries) and the data sorted according to the tactic employed
and the corresponding number of attacks and the resulting number of hits
suered. The results are shown in Table 7.4.
Analysis: Assume that the data set is large enough so that frequencies can
be considered as reasonable estimates of probabilities. Now consider an experiment consisting of selecting a warship at random; the various possible
outcomes are as follows: Ship type: L, ship is large, S ship is small; Naval
tactic: E, ship made an evasive maneuver, C, ship counterattacked; Ship nal
status: H, ship was hit. The problem may now be cast as that of computing probabilities of various appropriate events using the data as presented in
Table 7.4, and interpreting the results accordingly.
Among the various probabilities that can be computed from this table,
from the perspective of the Naval commanders, the following are the most
important:
1. P (H), P (H|E), and P (H|C)
Respectively, the unconditional probability of any warship getting hit
(i.e. the overall eectiveness of kamikaze attacks); the probability of
getting hit when taking evasive measures, and when counterattacking,
all regardless of size;
2. P (H|L); P (H|S)
2 P.M. Morse and G.E. Kimball, Methods of Operations Research, Oce of the Chief
of Naval Operations, Navy Department, Washington DC, 1946, Chapter 5.
3 cf. R. Coughlin and D.E. Zitarelli, The Ascent of Mathematics, McGraw-Hill, NY, 1984,
p396.

Application Case Studies I: Probability

211

Respectively, the probability of a large ship or a small ship getting hit

regardless of tactics employed;
3. P (H|L E); P(H|L C);
The probability of getting hit when a large ship is taking evasive maneuvers versus when counterattacking;
4. P (H|S E); P (H|S C);
the probability of getting hit when a small ship is taking evasive maneuvers versus when counterattacking.
These probabilities are easily computed from the table, yielding the following
results:
P (H) = 122/365 = 0.334
(7.13)
indicating that in general, about one in 3 attacked ships were hit, regardless
of size or survival tactic employed. Similarly,
P (H|E) =

P (H E)
60/365
=
= 0.333
P (E)
180/365

(7.14)

(or simply directly from the rightmost column in the table). Also,
P (H|C) = 62/185 = 0.335

(7.15)

The obvious conclusion: overall, there appears to be no dierence between the

eectiveness of evasive maneuvers as opposed to oensive counterattacks when
all the ships are considered together regardless of size. But does size matter?
Taking size into consideration (regardless of survival tactics) the probabilities are as follows:
P (H|L) = 38/97 = 0.392
P (H|S) = 84/268 = 0.313

(7.16)
(7.17)

so that it appears as if small ships have a slight edge in surviving the attacks,
regardless of tactics employed. But it is possible to rene these probabilities
further by taking both size and tactics into consideration, as follows:
For large ships, we obtain
P (H|L E) = 8/36 = 0.222

(7.18)

P (H|L C) = 30/61 = 0.492

(7.19)

where we see the rst clear indication of an advantage: large ships making
evasive maneuvers are more than twice as eective in avoiding hits as their
counterattacking counterparts.
For small ships,
P (H|S E) = 52/144 = 0.361
P (H|S C) = 32/124 = 0.258

(7.20)
(7.21)

212

Random Phenomena

and while the advantage is not nearly as dramatic as with large ships, it is
still quite clear that small ships are more eective when counterattacking.
The nal recommendation is now clear:

When attacked by Japanese suicide planes, the appropriate tactic

for small ships is to hold a steady course and counterattack; for
large ships, sharp evasive maneuvers are more eective.

7.3.3

Final Comments

In hindsight these conclusions from the Naval study are perfectly logical,
almost obvious; but at the time, the stakes were high, time was of the essence
and nothing was clear or obvious. It is gratifying to see the probabilistic
analysis bring such clarity and yield results that are in perfect keeping with
common sense after the fact.

7.4

Summary and Conclusions

This chapter, the rst of a planned trilogy of case studies, (see Chapters 11
and 20 for the others) has been concerned with demonstrating the application
of probability theory to two specic historical problems, each with its own
signicant practical implication that was probably not evident at the time the
work was being done. The rst, Mendels ground-breaking work on genetics,
is well-structured, and showed how a good theory can help clarify confusing
experimental data. The second, the US Navys analysis of kamikaze attack
data during WW II, is less structured. It demonstrates how data, converted
to empirical probabilities, can be used to make appropriate decisions. Viewed
from this distance in time, and from the generally elevated heights of scientic
sophistication of today, it will be all too easy to misconstrue these applications
as quaint, if not trivial. But that will be a gross mistake. The signicance of
these applications must be evaluated within the context of the time in history
when the work was done, vis-`
a-vis the tools available to the researchers at the
time. The US Naval application saved lives and irreversibly aected the course
of the the war in the Pacic theater. Mendels result did not just unravel a
vexing 19th century mystery; it changed the course of biological research for
good, even though it was not obvious at the time. All these were made possible
by the appropriate application of probability theory.

Part III

Distributions

213

215

Part III: Distributions

Modeling Random Variability

From a drop of water, a logician could infer

the possibility of an Atlantic or a Niagara . . .
So all life is a great chain, the nature of which is known
whenever we are shown a single link of it.
Sherlock Holmes A Study in Scarlet
Sir Arthur Conan Doyle (18591930)

216

Part III: Distributions

Modeling Random Variability

Chapter 8: Ideal Models of Discrete Random Variables

Chapter 9: Ideal Models of Continuous Random Variables
Chapter 10: Information, Entropy and Probability Models
Chapter 11: Application Case Studies II: In-Vitro Fertilization

Chapter 8
Ideal Models of Discrete Random
Variables

8.1
8.2
8.3

8.4

8.5

8.6

8.7

8.8

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Discrete Uniform Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 Basic Characteristics and Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Bernoulli Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1 Basic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.3 Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . . . .
The Hypergeometric Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.1 Basic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.3 Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.1 Basic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.3 Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . . . .
Relation to other random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Extensions and Special Cases of the Binomial Random Variable . . . . . . . .
8.6.1 Trinomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Some Important Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.2 Multinomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.3 Negative Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
8.6.4 Geometric Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Poisson Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.1 The Limiting Form of a Binomial Random Variable . . . . . . . . . . . . .
Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.2 First Principles Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.3 Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Standard Poisson Phenomena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Overdispersed Poisson-like Phenomena and the Negative Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

218
219
219
220
220
221
221
222
222
222
223
223
224
224
225
225
226
226
227
228
230
230
230
230
231
231
231
232
232
233
234
234
234
234
235
236
236
237
237
237
239
239
240
242
243

217

218

Random Phenomena
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
All these constructions and the laws connecting them
can be arrived at by the principle of looking for
the mathematically simplest concepts and the link between them.
Albert Einstein (18791955)

Having presented the probability distribution function, f (x), as our mathematical function of choice for representing the ensemble behavior of random
phenomena, and having examined the properties and characteristics of the
generic pdf extensively in the last four chapters, it now remains to present
specic probability distribution functions for some actual real-world phenomena of practical importance. We do this in each case by starting with all the
relevant information about the phenomenological mechanism behind the specic random variable, X; and, in much the same way as for deterministic
phenomena, we derive the expression for the pdf f (x) appropriate to the random phenomenon in question. The end result is several ideal models of random
variability, presented as a collection of probability distribution functions, each
derived directly fromand hence explicitly linked tothe underlying random
phenomenological mechanisms.
This chapter and the next one are devoted to the development and analysis of such models for some important random variables that are commonly
encountered in practice, beginning here with discrete random variables.

8.1

Introduction

As articulated briey in the prelude chapter (Chapter 0), it is entirely possible to develop, from rst-principles phenomenological considerations, appropriate theoretical characterizations of the variability inherent to random phenomena. Two primary benets accrue from this rst-principles approach:
1. It acquaints the reader with the mechanistic underpinnings of the random variable and the genesis of its pdf, making it less likely that the
reader will inadvertently misapply the pdf to a problem to which it is
unsuited. The single most insidious trap into which unsuspecting engineers and scientists often fall is that of employing a pdf inappropriately
to try and solve a problem requiring a totally dierent pdf: for example, attempting to use the (continuous) Gaussian pdfsimply out of
familiarityinappropriately to tackle a problem involving the (discrete)
phenomenon of the number of occurrences of safety incidents in a manufacturing site, a natural Poisson random variable.

Ideal Models of Discrete Random Variables

219

2. It demonstrates the principles and practice of how one goes about developing such probability models, so that should it become necessary
to deal with a new random phenomenon with no pre-existing canned
model, the reader is able to fall back on rst-principles to derive, with
condence, the required model.
The modeling exercise begins with a focus on discrete random variables
rst in this chapter, and continuous random variables next in the following
chapter. In developing these models, we will draw on ideas and concepts discussed in earlier chapters about random variables, probability, probability
distributions, the calculus of probability, etc., and utilize the following model
development and analysis strategy:

1. Identify basic characteristics of the problem;

2. Identify elementary events and the phenomenological mechanism for
combining elementary events into complex phenomena;
3. Combine components into a probability model. In each case, the resulting model will be an expression for computing P (X = x), where x takes
on discrete values 0, 1, 2, . . .;
4. Analyze, characterize and illustrate application of the model.

We start from the simplest possible random variable and build up from
there, presenting some results without proof or else leaving such proofs as
exercises to the reader where appropriate.

8.2
8.2.1

The Discrete Uniform Random Variable

Basic Characteristics and Model

The phenomenon underlying the discrete uniform random variable is as

follows:
1. The experiment has k mutually exclusive outcomes;
2. Each outcome is equiprobable; and
3. The random variable X assigns the k distinct values xi ; i = 1, 2, . . . , k,
to each respective outcome;

220

Random Phenomena

The model in this case is quite straightforward:

f (xi ) =

1
k;

i = 1, 2, . . . , k;

otherwise

(8.1)

with the random variable earning its name because f (x) is uniform across
the valid range of admissible values. Thus, Eq (8.1) is the pdf for the discrete
Uniform random variable, UD (k). The only characteristic parameter is k, the
total number of elements in the sample space. Sometimes the k elements are
indexed to include 0, i.e. i = 0, 1, 2, . . . k 1, (allowing easier connection to
the case where k = 2 and the only two outcomes are the binary numbers 0,
1). Under these circumstances, the mean and variance are:

= E(X) =

k
2

(8.2)

and
2 =

8.2.2

(k + 1)(k 1)
12

(8.3)

Applications

We have already encountered in Chapters 3 and 4 several examples of

the discrete uniform random variable: the tossing of an unbiased coin, (with
k = 2); the tossing of a fair die (with k = 6 and xi = i; i = 1, 2, . . . , 6); or,
in general, the selection of an item at random from a well-dened population
of size k (students from a peer group; balls from a bag; marbles from an urn,
etc). This model is therefore most useful in practice for phenomena in which
there is no justiable reason to expect one outcome to be favored over another
(See Chapter 10).
In the event that: (i) we restrict the number of outcomes to just two, i.e.
k = 2, (as in the single coin toss; or in the determination of the sex of a
newborn selected at random from a hospital; or in an in-vitro fertilization
treatment cycle, the success or failure of a single transferred embryo to result in a live birth); and (ii) we relax the equiprobable outcome condition,
thereby allowing the probability of the occurrence of each of the two possible
outcomes to dier (while necessarily respecting the constraint requiring the
two probabilities to sum to 1); the resulting random variable is known as a
Bernoulli random variable.

Ideal Models of Discrete Random Variables

8.3
8.3.1

221

The Bernoulli Random Variable

Basic Characteristics

The phenomenon underlying the Bernoulli random variable is as follows:

1. The experiment has only 2 possible mutually exclusive outcomes, S
(for success) or F (for failure). (Other possible designations are
defective/non-defective; or Pass/Fail, Head/Tail etc.);
2. The probability p, (0 < p < 1), is assigned to the outcome S;
3. The random variable X assigns the number 1 to the outcome S and
0 to the other outcome, F.
A random variable dened as above is a Bernoulli random variable; in fact, an
experiment characterized as in item 1 above is known as a Bernoulli trial.

8.3.2

Model Development

This is a very straightforward case in which every aspect of the problem

is simple and has been specied explicitly. From characteristic 1 above, the
sample space is
= {F, S}
(8.4)
consisting of only two elements; and since P (S) = p, and contains exactly
two elements, P (F ) must be (1p). Finally, from characteristic 3, VX = {0, 1}
so that:
PX (X = 0) = P (F )
PX (X = 1) = P (S)

= (1 p)
= p

The desired probability model is therefore given by:

(1 p); x = 0
f (x) =
p;
x=1

(8.5)
(8.6)

(8.7)

or, in tabular form,

x f (x)
0 (1 p)
1 p
This model can be made more compact as follows: introduce two indicator variables, the success indicator, IS , dened as:

1; for x = 1
IS =
(8.8)
0; for x = 0

222

Random Phenomena

and its complement, the failure indicator, IF

1; for x = 0
IF =
0; for x = 1

(8.9)

The pdf for the Bernoulli random variable is then given by the more compact:
f (x) = pIS (1 p)IF

(8.10)

The Bernoulli random variable, Bn(p), is therefore a binary variable; it takes

on only two values: 0, with probability (1 p), or 1, with probability p.

8.3.3

Important Mathematical Characteristics

The following are important characteristics of the Bernoulli random variable, Bn(p), and its pdf:
1. Characteristic parameter: p; the probability of success.
2. Mean: = E(X) = p.
3. Variance: 2 = p(1 p); or =

p(1 p)

4. Moment generating function: M (t) = [pet + (1 p)]

5. Characteristic function: (t) = [pejt + (1 p)]
These characteristics are easily established (see Exercise 8.1). For example,
by denition,

E(X) =
xi f (xi ) = 0(1 p) + 1 p = p
(8.11)
i

8.4
8.4.1

The Hypergeometric Random Variable

Basic Characteristics

The hypergeometric random variable arises naturally from problems in

acceptance sampling, and similar problems involving drawing samples randomly from a nite-sized population; the basic phenomenon underlying it is
as follows:
1. The population (or lot) is dichotomous, in the sense that its elements
can be divided into two mutually exclusive groups;
2. The total number of units in the lot (equivalently, elements in the population) is N ;

Ideal Models of Discrete Random Variables

223

Nd of these share a common attribute of interest (e.g. defective);

the remaining (N Nd ) do not have this attribute;
the population proportion of defective items, p, is therefore
Nd /N ;
3. The experiment: Draw a total of n items; test them all for the presence
of said attribute;
4. The random variable X: the number of defective items in the sample,
n;
5. Assumption: The sample is drawn such that each unit in the lot has an
equal chance of being drawn.

8.4.2

Model Development

The Sample Space: After each experiment, the outcome i is the ntuple
i = [a1 , a2 , . . . , an ]i

(8.12)

where each aj ; j = 1, 2, . . . , n is the attribute of the j th item drawn (aj is

therefore either D for defective, or F for defect-free).
Now the total number of ways of choosing n items from a lot of size N is:

N
N!
(8.13)
=
N =
n
n!(N n)!
The sample space, the collection of all such possible outcomes, i , therefore
contains i = 1, 2, . . . , N elements.
The Model: Since the total number of defectives contained in each i
is the random variable of interest in this case, rst, observe that obtaining
x defectives from a sample size of n arises from choosing x from Nd and
(n x) from (N Nd ); on the assumption of equiprobable drawings for each
item, we obtain
P (X = x) =

Total number of favorable ways for X = x

Total number of possible choices

Thus,

(8.14)

Nd N Nd
f (x) =

Nnx

(8.15)

is the pdf for a hypergeometric random variable, H(n, Nd , N ). Alternatively,

Nd !
(N Nd )!
1
c x!(Nd x)! (n x)!(N Nd n + x)!

where the constant c is N
n .
f (x) =

(8.16)

224

Random Phenomena

8.4.3

Important Mathematical Characteristics

The following are important characteristics of the hypergeometric random

variable, H(n, Nd , N ), and its pdf:
1. Characteristic parameters: n, Nd , N , respectively, the sample size,
total number of defectives, and the population or lot size.
2. Mean: = E(X) = nNd /N = np
n)
3. Variance: 2 = np(1 p) (N
(N 1) ;

8.4.4

Applications

This random variable and its pdf model nd application mostly in acceptance sampling. The following are a few examples of such applications.
Example 8.1 APPLICATION OF THE HYPERGEOMETRIC
MODEL
A batch of 20 electronic chips contains 5 defectives. Find the probability
that out of 10 selected for inspection (without replacement) 2 will be
found defective.
Solution:
In this case, x = 2, Nd = 5, N = 20, n = 10 and therefore:
f (x) = 0.348

(8.17)

Example 8.2 APPLICATION OF THE HYPERGEOMETRIC

MODEL
An order of 25 high-reliability electron tubes has been shipped to your
company. Your acceptance sampling protocol calls for selecting 5 at
random to put through a destructive accelerated life-test; if fewer than
2 fail the test, the remaining 20 are accepted, otherwise the shipment
is rejected. What is the probability of accepting the lot if truly 4 out of
the 25 tubes are defective?
Solution:
In this case, Nd = 4, N = 25, n = 5; and we require P (X = 0, 1):
P (Lot acceptance) = f (0) + f (1) = 0.834

(8.18)

so that there is a surprisingly high probability that the lot will be accepted even though 16% is defective. Perhaps the acceptance sampling
protocol needs to be re-examined.

Ideal Models of Discrete Random Variables

8.5
8.5.1

225

The Binomial Random Variable

Basic Characteristics

The basic phenomenon underlying the binomial random variable is as follows:

1. Each experiment consists of n independent Bernoulli trials under identical conditions; i.e. each trial produces exactly two mutually exclusive
outcomes nominally labeled S (success) and F (failure);
2. The probability of success in each trial, P (S) = p;
3. The random variable, X, is the number of successes in the n trials.
It should be clear from these characteristics that the binomial random variable
is related to the Bernoulli random variable and also to the hypergeometric
random variable. We will identify the explicit relationships shortly.
More importantly, a good number of real-life problems can be idealized
in the form of the simple conceptual experiment of tossing a coin n times
and observing x, the total number of heads: for example, in in-vitro fertilization (IVF) procedures, the total number of live births resulting from the
transfer of n embryos in a patients womb, given that p is the probability of
a successful pregnancy from a single embryo, is a binomial random variable
(See case study in Chapter 11). Similarly, in characterizing the reliability of a
system consisting of n components, given that p is the probability of a single
component functioning, then x, the total number of components functioning
at any specic time is also a binomial random variable. The binomial random variable is therefore important in its own right; but, as we will soon see,
along with its probability model, it also serves as the launching pad for the
development of other probability models of important phenomena.

8.5.2

Model Development

The Sample Space: Each outcome of a binomial experiment may be

represented as the n-tuple:
i = [s1 , s2 , . . . , sn ]i

(8.19)

a string of n letters that are either S or F. Because each trial results in

exactly one of two mutually exclusive results (S or F), there are precisely
2n of such elements i in the sample space (recall Example 3.8 in Chapter
3, especially, Eqn. (3.28)); i.e.
= {wi }21

(8.20)

226

Random Phenomena

The Random Variable X: The total number of occurrences of S contained

in each i is the random variable of interest in this case. The most primitive
elementary events, the observation of an S or an F in each trial, are
mutually exclusive; and the observation of a total number of x occurrences of
S in each experiment corresponds to the compound event Ex where
Ex = {x occurrences of S and (n x) occurrences of F}

(8.21)

The Model: Given that the probability of success in each trial, P (S) = p,
and by default P (F ) = (1 p) = q, then by the independence of the n trials
in each experiment, the probability of the occurrence of the compound event
Ex dened above is:
P (Ex ) = px (1 p)nx
(8.22)
n
However, in the original sample space , there are x dierent such events in
which the sequence in i contains x successes and (n x) failures, where

n
n!
(8.23)
=
x
x!(n x)!
Thus, P (X = x) is the sum of all events contributing to the pre-image in
of the event that the random variable X takes on the value x; i.e.

n x
P (X = x) =
p (1 p)nx
(8.24)
x
Thus, the pdf for the binomial random variable, Bi(n, p), is:
f (x) =

8.5.3

n!
px (1 p)nx
x!(n x)!

(8.25)

Important Mathematical Characteristics

The following are important characteristics of the binomial random variable, Bi(n, p) and its pdf:
1. Characteristic parameters: n, p; respectively, the number of independent trials in each experiment, and the probability of success in each
trial;
2. Mean: = E(X) = np;
3. Variance: 2 = np(1 p);
4. Moment generating function: M (t) = [pet + (1 p)]n ;
5. Characteristic function: (t) = [pejt + (1 p)]n ;

Ideal Models of Discrete Random Variables

227

Relation to other random variables

The binomial random variable is intimately related to other random variables we are yet to encounter (see later); it is also related as follows to the
two random variables we have discussed thus far:
1. Hypergeometric: It is easy to show that the hypergeometric random
variable, H(n, Nd , N ), and the binomial random variable Bi(n, p) are related
as follows:
lim H(n, Nd , N ) = Bi(n, p)
(8.26)
N

where p = Nd /N remains constant.

2. Bernoulli: If X1 , X2 , . . . , Xn are n independent Bernoulli random variables, then:
n

Xi
(8.27)
X=
i=1

is a binomial random variable. This is most easily established by computing

the characteristic function for X as dened in (8.27) from the characteristic
function of the Bernoulli random variable. (See Exercise 8.5.)

8.5.4

Applications

The following are a few examples of practical applications of the binomial

model.
Example 8.3 APPLICATION OF THE BINOMIAL MODEL:
ANALYSIS
From experience with a battery manufacturing process, it is known that
5% of the products from a particular site are defective. Find the probability of obtaining 2 defective batteries in a batch of 20 drawn from
a large lot manufactured at this particular site, and show that we are
more likely to nd 1 defective battery in the sample of 20 than 2.
Solution:
This problem may be idealized as that of computing P (X = 2) where
X is a binomial random variable Bi(20, 0.05) i.e. where n the number
of trials is 20, and the probability of success is 0.05. In this case then:

20
(8.28)
P (X = 2) = f (2) =
(0.05)2 (0.95)18 = 0.189
2
On the other hand,

P (X = 1) = f (1) =

20
(0.05)(0.95)19 = 0.377
1

(8.29)

so that we are almost twice as likely to nd only 1 defective battery in

the sample of 20 than 2.

228

Random Phenomena
Example 8.4 APPLICATION OF THE BINOMIAL MODEL:
DESIGN
From the sales record of an analytical equipment manufacturing company, it is known that their sales reps typically make on average one
sale of a top-of-the-line near-infrared device for every 3 attempts. In
preparing a training manual for future sales reps, the company would
like to specify n, the smallest number of sales attempts each sales rep
should make (per week) such that the probability of scoring an actual
sale (per week) is greater than 0.8. Find n.
Solution:
This problem may also be idealized as involving a binomial Bi(n, p)
random variable in which p = 1/3 but for which n is an unknown to be
determined to satisfy a design criterion. Finding the probability of the
event of interest, (X 1), is easier if we consider the complementthe
event of making no sale at all (X = 0), i.e.
P (X 1) = 1 P (X = 0)
In this case, since
f (0) =
then, we want
1

n
2
3

n
2
> 0.8
3

(8.30)

(8.31)

(8.32)

from where we obtain the smallest n to satisfy this inequality to be 4.

Thus, the sales brochure should recommend that each sales rep make
at least 4 sales attempt to meet the company goals.

Inference
A fundamental question about binomial random variables, and indeed all
random variables, centers around how the parameters indicated in the pdfs
may be determined from data. This is a question that will be considered later
in greater detail and in a broader context; for now, we consider the following
specic question as an illustration: Given data, what can we say about p?
In the particular case of a coin toss, this is a question about determining
the true probability of obtaining a head (or tail) given data from actual coin toss experiments; in the case of predicting the sex of babies, it is
about determining the probability of having a boy or girl given hospital birth
records; and, as discussed extensively in Chapter 11, in in-vitro fertilization,
it is determining from appropriate fertility clinic data, the probability that a
particular single embryo will lead to a successful pregnancy. The answer to
this specic question is one of a handful of important fundamental results of
probability theory.
Let X be the random variable representing the number of successes in n
independent trials, each with an equal probability of success, p, so that X/n
is the relative frequency of success.

Ideal Models of Discrete Random Variables

229

Since it seems intuitive that the relative frequency of success should be a

reasonable estimate of the true probability of success, we are interested in
computing P (| X
n p| ) for some > 0, i.e. in words,

what is the probability that the relative frequency of success will

dier from the true probability by more than an arbitrarily small
number, ?

Alternatively, this may be restated as:

P (|X np| n)

(8.33)

Cast in terms of Chebyshevs inequality, which we recall as:

P (|X | k)

1
,
k2

(8.34)

the bound on the probability we seek is given by:

P (|X np| n)

2
2 n2

(8.35)

since, by comparison of the left hand sides of these two equations, we obtain
k = n from which k is easily determined, giving rise to the RHS of the
inequality above. And now because we are particularly concerned with the
binomial random variable for which = np and 2 = np(1 p), we have:
P (|X np| n)
For every > 0, as n ,

lim

p(1 p)
n2

(8.36)

(8.37)

giving the important result:

X

lim P p = 0
n
n

(8.38)

or the complementary result:

X

lim P p = 1
n
n

(8.39)

Together, these two equations constitute one form of the Law of Large Numbers indicating, in this particular case, that the relative frequency of success

230

Random Phenomena

(the number of successes observed per n trials) approaches the actual probability of success, p, as n , with probability 1. Thus, for a large number of
trials:
x
p.
(8.40)
n

8.6

Extensions and Special Cases of the Binomial Random Variable

We now consider a series of random variables that are either direct extensions of the binomial random variable (the trinomial and general multinomial
random variables), or are special cases (the negative binomial and the geometric random variables).

8.6.1

Trinomial Random Variable

Basic Characteristics
In direct analogy to the binomial random variable, the following basic
phenomenon underlies the trinomial random variable:
1. Each experiment consists of n independent trials under identical conditions;
2. Each trial produces exactly three mutually exclusive outcomes,
1 , 2 , 3 , (Good, Average, Poor; A, B, C; etc);
3. In each single trial, the probability of obtaining outcome 1 is p1 ; the
probability of obtaining 2 is p2 ; and the probability of obtaining 3 is
p3 , with p1 + p2 + p3 = 1;
4. The random variable of interest is the two-dimensional, ordered pair
(X1 , X2 ), where X1 is the number of times that outcome 1, 1 , occurs in
the n trials; and X2 is the number of times that outcome 2, 2 , occurs in
the n trials. (The third random variable, X3 , the complementary number
of times that outcome 3, 3 , occurs in the n trials, is constrained to be
given by X3 = n X1 X2 ; it is not independent.)
The Model
It is easy to show, following the same arguments employed in deriving the
binomial model, that the trinomial model is:
f (x1 , x2 ) =

n!
px1 px2 pnx1 x2 ,
x1 !x2 !(n x1 x2 )! 1 2 3

(8.41)

Ideal Models of Discrete Random Variables

231

the joint pdf for the two-dimensional random variable (X1 , X2 ).

Some Important Results
The moment generation function (mgf) for the trinomial random variable
is:
M (t1 , t2 ) =

n nx

1
x1 =0 x2 =0

t1 x1 t2 x2 nx1 x2

n!
p3
p1 e
p2 e
x1 !x2 !(n x1 x2 )!
(8.42)

which simplies to:

n

M (t1 , t2 ) = p1 et1 + p2 et2 + p3

(8.43)

We now note the following important results:

"n
!
M (t1 , 0) = (1 p1 ) + p1 et1
"n
!
M (0, t2 ) = (1 p2 ) + p2 et2

(8.44)
(8.45)

indicating marginal mgfs, which, when compared with the mgf obtained earlier
for the binomial random variable, shows that:
1. The marginal distribution of X1 is that of the Bi(n, p1 ) binomial random
variable;
2. The marginal distribution of X2 is that of the Bi(n, p2 ) binomial random
variable.

8.6.2

Multinomial Random Variable

It is now a straightforward exercise to extend the discussion above to the

multinomial case where there are k mutually exclusive outcomes in each trial,
each with the respective probabilities of occurrence, pi ; i = 1, 2, 3, . . . , k, such
that:
k

pi = 1
(8.46)
i=1

The resulting model is the pdf:

f (x1 , x2 , . . . , xk ) =

n!
px1 px2 . . . pxk k
x1 !x2 ! . . . xk ! 1 2

(8.47)

along with
k

i=1

xi = n

(8.48)

232

8.6.3

Random Phenomena

Negative Binomial Random Variable

Basic Characteristics
The basic phenomenon underlying the negative binomial random variable
is very similar to that for the original binomial random variable dealt with
earlier:
1. Like the binomial random variable, each trial produces exactly two mutually exclusive outcomes S (success) and F (failure); the probability
of success in each trial, P (S) = p;
2. The experiment consists of as many trials as are needed to obtain k
successes, with each trial considered independent, and carried out under
identical conditions;
3. The random variable, X, is the number of failures before the k th
success. (Since the labels S and F are arbitrary, this could also
be considered as the number of successes before the k th failure if it
is more logical to consider the problem in this fashion, so long as we are
consistent with what we refer to as a success and its complement that
is referred to as the failure.)
Model Development
From the denition of the random variable, X, n, the total number of
trials required to obtain exactly k successes is X + k; and mechanistically, the
event X = x occurs as a combination of two independent events: (i) obtaining
x failures and k 1 successes in the rst x + k 1 trials and (ii) obtaining a
success in the (x + k)th trial. Thus:
P (X = x)

P (x failures and k 1 successes in the rst x + k 1 trials)

P (successes in the (x + k)th trial)

and from the binomial pdf, we obtain:

x + k 1 k1
(1 p)x
P (X = x) =
p
k1

x+k1 k
=
p (1 p)x
k1

(8.49)

p
(8.50)

Thus, the model for the negative binomial random variable N Bi(k, p) is:

x+k1 k
f (x) =
p (1 p)x ; x = 0, 1, 2, . . .
(8.51)
k1
which is also sometimes written in the entirely equivalent form (see Exercise
8.10):

x+k1 k
(8.52)
f (x) =
p (1 p)x ; x = 0, 1, 2, . . .
x

Ideal Models of Discrete Random Variables

233

(In some instances, the random variable is dened as the total number of
trials required to obtain exactly k successes; the discussion above is easily
modied for such a denition of X. See Exercise 8.10).
In the most general sense, the parameter k of the negative binomial random variable in fact does not have to be an integer. In most engineering
applications, however, k is almost always an integer. In honor of the French
mathematician and philosopher, Blaise Pascal (16231662), in whose work
one will nd the earliest mention of this distribution, the negative binomial
distribution with integer k is often called the Pascal distribution. When the
parameter k is real-valued, the pdf is known as the Polya distribution, in honor
of the Hungarian mathematician, George P
olya (18871985), and written as:
f (x) =

(x + k) k
p (1 p)x ; x = 0, 1, 2, . . .
(k)x!

where () is the Gamma function dened as:

ey y 1 dy
() =

(8.53)

(8.54)

This function satises the recursive expression:

( + 1) = ()

(8.55)

so that if is a positive integer, then

() = ( 1)!

(8.56)

and the pdf in Eqs (8.53) will coincide with that in Eq (8.51) or Eq (8.52).
Important Mathematical Characteristics
The following are important characteristics of the negative binomial random variable, N Bi(k, p), and its pdf:
1. Characteristic parameters: k, p; respectively, the target number of
successes, and the probability of success in each trial;
2. Mean: = E(X) =
3. Variance: 2 =

k(1p)
p

k(1p)
p2

kq
p ;

kq
p2 ;

4. Moment generating function: M (t) = pk (1 qet )k

5. Characteristic function: (t) = pk (1 qejt )k .
An alternative form of the negative binomial pdf arises from the following
re-parameterization: Let the mean value be represented as , i.e.,

1
k(1 p)
=k
1
(8.57)
=
p
p

234

Random Phenomena

so that
p=

k
k+

(8.58)

in which case, Eq (8.51) becomes

f (x) =

x
(x + k 1)!

(k 1)!x! 1 + k

(8.59)

We shall have cause to refer to this form of the pdf shortly.

8.6.4

Geometric Random Variable

Consider the special case of the negative binomial random variable with
k = 1; where the resulting random variable X is the number of failures before the rst success. It follows immediately from Eqn. (8.51) that the required
pdf in this case is:
f (x) = pq x ; x = 0, 1, 2, . . .
(8.60)

The Model
It is more common to consider the geometric random variable as the number of trials not failures required to obtain the rst success. It is easy
to see that this denition of the geometric random variable merely requires a
shift by one in the random variable discussed above, so that the pdf for the
geometric random variable is given by:
f (x) = pq x1 ; x = 1, 2, . . .

(8.61)

Important Mathematical Characteristics

The following are important characteristics of the geometric random variable, G(p), and its pdf:
1. Characteristic parameter: p, the probability of success in each
trial;
2. Mean: = E(X) = 1p ;
3. Variance: 2 =

q
p2 ;

4. Moment generating function: M (t) = pq (1 qet )1

5. Characteristic function: (t) = pq (1 qejt )1 .

Ideal Models of Discrete Random Variables

235

Applications
One of the most important applications of the geometric random variable
is in free radical polymerization where, upon initiation, monomer units add to
a growing chain, with each subsequent addition propagating the chain until
a termination event stops the growth. After initiation, each trial involves
either a propagation event (the successful addition of a monomer unit to
the growing chain), or a termination event, where the polymer chain is
capped to yield a dead polymer chain that can no longer add another
monomer unit. Because the outcome of each trial (propagation or termination) is random, the resulting polymer chains are of variable length; in fact,
the physical properties and performance characteristics of the polymer are
related directly to the chain length distribution. It is therefore of primary
interest to characterize polymer chain length distributions appropriately.
Observe that as described above, the phenomenon underlying free-radical
polymerization is such that each polymer chain length is precisely the total
number of monomer units added until the occurrence of the termination event.
Thus, if termination is considered a success, then the chain length is a geometric random variable. In polymer science textbooks (e.g. Williams, 19711),
chemical kinetics arguments are often used to establish what is referred to
as the most probable chain length distribution; the result is precisely the
geometric pdf presented here. In Chapter 10, we use maximum entropy considerations to arrive at the same results.
Example 8.5 APPLICATION OF THE GEOMETRIC DISTRIBUTION MODEL
From their prior history, it is known that the probability of a building
construction company recording an accident (minor or major) on any
day during construction is 0.2. (a) Find the probability of going 7 days
(since the last occurrence) before recording the 1st accident. (b) What
is the expected number of days before recording the 1st accident?
Solution:
This problem clearly involves the geometric random variable with p =
0.2. Thus
(a) the required P (X = 7) is obtained as:
f (7) = 0.2(0.8)6 = 0.05

(8.62)

so that because of the relatively high probability of the occurrence of

an accident, it is highly unlikely that this company can go 7 days before
recording the rst accident.
(b) The expected value of days in between accidents is:
E(X) =

1
= 5 days
0.2

(8.63)

so that one would expect on average 5 days before recording an accident.

1 D.J.

Williams, Polymer Science and Engineering, Prentice Hall, NJ, 1971, pp58-59

236

8.7

Random Phenomena

The Poisson Random Variable

The Poisson random variable is encountered in so many practical applications, ranging from industrial manufacturing to physics and biology, and even
in such military problems as the historic study of deaths by horse-kicks in the
Prussian army, and the German bombardment of London during World War
II.
We present here two approaches to the development of the probability
model for this important random variable: (i) as a limiting form of the binomial
(and negative binomial) random variable; (ii) from rst principles.

8.7.1

The Limiting Form of a Binomial Random Variable

Consider a binomial random variable under the following conditions:

1. The number of trials is very large (n );
2. But as the number of trials becomes very large, the probability of success
becomes very small to the same proportion such that np = remains
constant.
i.e. we wish to consider the binomial random variable in the limit as n
but with np remaining constant at the value . The underlying phenomenon
is therefore that of the occurrence of rare events (with very small probabilities
of occurrence) in a large number of trials.
Model Development
From Eq. (11.1), the pdf we seek is given by:

n!
px (1 p)nx
f (x) = n
lim
x!(n

x)!
np=

(8.64)

which may be written as:

#
x
nx $
n(n 1)(n 2) . . . (n x + 1)(n x)!

1
f (x) = lim
n
x!(n x)!
n
n
#
$

n
1(1 1/n)(1 2/n) . . . (1 (x 1)/n) x 1 n

= lim
(8.65)

x
n
x!
1 n
Now, because:

1
n
n

x

lim 1
n
n
lim

(8.66)

(8.67)

Ideal Models of Discrete Random Variables

237

the latter being the case because x is xed, f (x) therefore reduces to:
f (x) =

e x
; x = 0, 1, 2, . . .
x!

(8.68)

This is the pdf of the Poisson random variable, P(), with the parameter .
It is also straightforward to show (see Exercise 8.14), that the Poisson pdf
arises in the limit as k for the negative binomial random variable, but
with the mean kq/p = remaining constant; i.e., from Eq (8.59),
#
$
x
(x + k 1)!
e x
; x = 0, 1, 2, . . .
(8.69)
f (x) = lim
=

k
(k 1)!x! 1 + k
x!
k

8.7.2

First Principles Derivation

Basic Characteristics
Considered from rst principles, the basic phenomenon underlying the
Poisson random variable is as follows:
1. The experiment consists of observing the number of occurrences of a
particular event (a success) in a given xed interval (of time, length,
space) or area, or volume, etc, of size z;
2. The probability, p, of observing exactly one success in a sub-interval
of size z units (where z is small), is proportional to z; i.e.
p = z

(8.70)

Here, , the rate of occurrence of the success per unit interval, is

constant. The probability of not observing the success is (1 p).
3. The probability of observing more than 1 success within the subinterval is negligible, by choice of z. (The mathematical implication
of this statement is that the indicated probability is O(z), a quantity
that goes to zero faster than z, such that limz0 {O(z)/z} = 0);
4. Homogeneity: The location of any particular sub-interval of size z is
immaterial and does not aect the probability of success or failure;
5. Independence: The occurrence of a success in one sub-interval is independent of the occurrence in any other sub-interval;
6. The random variable, X, is the total number of successes observed in
the entire interval of size z.
Observe that these characteristics t that of a binomial random variable with a
large number of trials, n = (z/z), each trial having two mutually exclusive
outcomes, success or failure, with a small probability of success (z).

238

Random Phenomena

Model Development
We start by dening Px (z),
Px (z) = P (X = x in an interval of size z)

(8.71)

Then the probability of observing x successes in an interval of size z + z

is given by
(8.72)
Px (z + z) = P (E1 E2 E3 )
where E1 , E2 and E3 are mutually exclusive events dened as follows:
E1 : exactly x successes are observed in the interval of size z and none
are observed in the adjacent sub-interval of size z;
E2 : exactly x 1 successes are observed in the interval of size z and
exactly one more is observed in the adjacent sub-interval of size z;
E3 : exactly x i successes are observed in the interval of size z and
exactly i more are observed in the adjacent sub-interval of size z, with
i = 2, 3, . . . , x;
From the phenomenological description above, we have the following results:
P (E1 ) =
P (E2 ) =

Px (z)(1 z)
Px1 (z)z

(8.73)
(8.74)

P (E3 ) =

O(z)

(8.75)

Hence,
Px (z + z) = Px (z)(1 z) + Px1 (z)z + O(z); x = 1, 2, . . . . (8.76)
In particular, for x = 0, we have:
P0 (z + z) = P0 (z)(1 z)

(8.77)

since in this case, both E2 and E3 are impossible events. Dividing by z in

both Eqs (8.76) and (8.77) and rearranging gives
Px (z + z) Px (z)
z
P0 (z + z) P0 (z)
z

[Px (z) Px1 (z)] +

P0 (z)

O(z)
z

(8.78)
(8.79)

from where taking limits as z 0 produces the following series of dierential

equations:
dPx (z)
dz
dP0 (z)
dz

= [Px (z) Px1 (z)]

(8.80)

= P0 (z)

(8.81)

Ideal Models of Discrete Random Variables

239

To solve these equations requires the following initial conditions: P0 (0), the
probability of nding no success in the interval of size 0 a certain event
is 1; Px (0), the probability of nding x successes in the interval of size 0
an impossible event is 0. With these initial conditions, we obtain, rst
for P0 (z), that
(8.82)
P0 (z) = ez
which we may now introduce into Eq. (8.80) and solve recursively for x =
1, 2, ... to obtain, in general (after some tidying up),
Px (z) =

(z)x e(z)
x!

(8.83)

Thus, from rst principles, the model for the Poisson random variable is given
by:
x e
(z)x ez
=
(8.84)
f (x) =
x!
x!

8.7.3

Important Mathematical Characteristics

The following are important characteristics of the Poisson random variable,

P(), and its pdf:
1. Characteristic parameters: , or z, the mean number of successes
in the entire interval of size z; and , the mean number of successes
per unit interval size, is sometimes called the intensity.
2. Mean: = E(X) =
3. Variance: 2 = ;
4. Moment generating function: M (t) = e[(e

5. Characteristic function: (t) = e[(e

1)]

6. Reproductive Properties: The Poisson random variable (as with a

few others to be discussed later) possesses the following useful property:
If Xi , i = 1, 2, . . . , n, are n independent Poisson random variables each
with parameter i , i.e. Xi P(i ), then the random variable Y dened
as:
n

Xi
(8.85)
Y =
i=1

n
is also a Poisson random variable, with parameter = i=1 i . Because a sum of Poisson random variables begets another Poisson random
variable, this characteristic is known as a reproductive property. This
result is easily established using the method of characteristic functions
discussed in Chapter 6. (See Exercise 8.17.)

240

Random Phenomena

8.7.4

Applications

Standard Poisson Phenomena

This Poisson random variable and its pdf nd application in a wide variety
of practical problems. Recall, for example, the problem of Quality Assurance
in a Glass Sheet Manufacturing Process considered in Chapter 1, involving
the number of inclusions per square meter of a manufactured glass sheet.
The pdf we simply stated in that chapter for the random variable in question
without justicationis now recognizable as the Poisson pdf. The same is
true of the application to the number of cell divisions in a xed time interval
t discussed in Chapter 6. Following the preceding discussion, it should now
be clear to the reader why this was in fact the appropriate pdf to use in each
of these applications: the random variable of concern in each example (the
number of inclusions in the Chapter 1 example; the number of times each
cell in the cell culture divides in the Chapter 6 application) each possesses all
the characteristics noted above for the Poisson random variable.
Not surprisingly, the Poisson model also nds application in the analysis of
annual/monthly/weekly occurrences of safety incidents in manufacturing sites;
the number of yarn breaks per shift in ber spinning machines, and other such
phenomena involving counts of occurrences of rare events in a nite interval.
The pdf is also used as an approximation to binomial random variables with
large n and small p, (with np 7), where the binomial pdf would have been
quite tedious to use. The following are a few illustrative examples:
Example 8.6 APPLICATION OF THE POISSON DISTRIBUTION MODEL
Silicon wafers of a particular size made by a chip manufacturer are
known to have an average of two contaminant particles each. Determine the probability of nding more than 2 contaminant particles on
any such wafer chosen at random.
Solution:
This problem involves a Poisson random variable with = 2. Thus,
since
P (X > 2) = (1 P (X 2))
(8.86)
the required probability is obtained as:
P (X > 2) = 1 (f (0) + f (1) + f (2))
when f (x) is given by:
f (x) =

e2 2x
x!

(8.87)

(8.88)

so that:
P (X > 2) = 1 (0.135 + 0.271 + 0.271) = 0.325

(8.89)

Example 8.7 APPLICATION OF THE POISSON DISTRIBUTION MODEL

Given that the probability of nding 1 blemish in a foot-long length of

Ideal Models of Discrete Random Variables

241

TABLE 8.1:

Theoretical
versus empirical frequencies for
inclusions data
Theoretical Empirical
f (x)
Frequency
x
0
0.3679
0.367
0.3679
0.383
1
0.1839
0.183
2
3
0.0613
0.017
4
0.0153
0.033
0.0031
0.017
5
6
0.0005
0.000

a ber optics wire is 1/1000, and that the probability of nding more
than one blemish in this foot-long unit is 0, determine the probability
of nding 5 blemishes in a 3,000 ft roll of wire.
Solution:
This problem involves a Poisson random variable with z = 1 foot; and
the intensity = 1/1000 per foot. For z = 3, 000 ft,
= z = 3.0

(8.90)

and the required probability is obtained as:

f (5) =

e3 35
= 0.101
5!

(8.91)

The next example is a follow up to the illustrative example in Chapter 1.

Example 8.8 APPLICATION OF THE POISSON MODEL TO
QUALITY ASSURANCE IN GLASS MANUFACTURING
If X, the number of inclusions found on glass sheets produced in the
manufacturing process discussed in Chapter 1 can be considered as a
Poisson random variable with theoretical value = 1, (1) compute the
theoretical probabilities of observing x = 0, 1, 2, . . . , 6 inclusions, and
compare these with the empirical frequencies generated from the data
in Table 1.2 and shown in Table 1.5. (2) If, as stated in Chapter 1,
only sheets with 3 or fewer inclusions are acceptable and can be sold
unconditionally, theoretically what percentage of the product made in
this process can be expected to fall into this desired category? (3) What
is the theoretical probability of this process producing sheets with 5 or
more inclusions?
Solution:
(1) From Eq. (8.68) with = 1, we obtain the values of f (x) shown in
Table 8.1 along with the empirical frequencies computed from the data
given in Chapter 1. (We have deliberately included an extra signicant
gure in the computed f (x) to facilitate comparison.) Observe how close
the theoretical probabilities are to the empirical frequencies, especially
for x = 0, 1, 2 and 6. Rigorously and quantitatively determining whether

242

Random Phenomena
the discrepancies observed for values of x = 3, 4 and 5 are signicant
or not is a matter taken up in Part IV.
(2) The probability of obtaining 3 or fewer inclusions is computed
as follows:
3

f (x) = 0.981
(8.92)
P (X 3) = F (3) =
x=0

implying that 98.1% of glass sheets manufactured in this process can

theoretically be expected to be acceptable, according to the stated criterion of 3 or fewer inclusions.
(3) The required probability, P (X > 5), is obtained as follows:
P (X > 5) = 1 P (X 5) = 1 0.9994 = 0.0006,

(8.93)

indicating a very small probability that this process will produce sheets
with 5 or more inclusions.

Overdispersed Poisson-like Phenomena and the Negative Binomial

distribution
The equality of mean and variance is a unique distinguishing characteristic of the Poisson random variable. As such, for true Poisson phenomena,
it will always be the case that = 2 = in theory, and, within limits of
data variability, in practice. There are cases of practical importance, however, where the Poisson-like phenomenon (e.g., the number of occurrences of
rare eventswith small probabilities of occurrence) possesses a rate (or intensity) parameter that is not uniformly constant. Under these circumstances,
the variance of the random variable will exceed the mean, giving rise to what
is generally known as overdispersion, for which the Poisson model will no
longer be strictly valid. Examples include such phenomena as counts of certain species of insects found in sectors of a farmland; the number of accidents
reported per month to an insurance company; or the number of incidents of
suicide per year in various counties in a state. These phenomena clearly involve
the occurrences of rare events, but in each case, the characteristic Poisson parameter is not uniform across the entire domain of interest. With the insects,
for example, the spatial aggregation per unit area is not likely to be uniform
across the entire farm area because certain areas may be more attractive to
the insects than others; the susceptibility to accidents is not constant across
all age groups; and with human populations, not everyone in the region of
interest is subject to the same risk for suicide.
For such problems, the negative binomial model is more appropriate. First,
observe from Eq (8.69) that the negative binomial pdf, with two parameters,
k and p, provides more exibility (with nite k) than the limiting Poisson
case; furthermore, the variance, k(1 p)/p2 , is always dierent fromin fact
always larger thanthe mean, k(1 p)/p, a pre-requisite for overdispersion.
More fundamentally, however, it can be shown from rst principles that the
negative binomial model is in fact the appropriate model for such phenomena.

Ideal Models of Discrete Random Variables

243

We postpone until Section 9.1 of Chapter 9 the establishment of this result

because it requires the use of a probability model that will not be discussed
until then. In the meantime, Application Problem 8.28 presents an abbreviated
version of the original historical application.

8.8

Summary and Conclusions

If proper analysis of randomly varying phenomena must begin with an

appropriate model of the phenomenon in question, then the importance of
this chapters techniques and results to such analysis cannot be overstated.
This chapter is where the generic pdf characterized extensively in Chapters
4 and 5 rst begins to take on specic and distinct personalities in the
form of unique and identiable probability models for various discrete randomly varying phenomena. In developing these probability models, we began,
in each case, with a description of the underlying phenomena; and to turn
these descriptions into mathematical models, we invoked, to varying degrees,
the ideas of the sample space and probability discussed in Chapter 3, the random variable and pdf of Chapters 4 and 5, and, in a limited number of cases,
the techniques of random variable transformations of Chapter 6. The end result has been a wide array of probability models for various discrete random
variables, the distinguishing characteristics associated with each model, and
the application to which each model is most naturally suited.
The modeling exercises of this chapter have also provided insight into how
the various models, and the random phenomena they represent, are related.
For example, we now know that the geometric distributionapplicable to
problems of chain length distribution in free-radical polymerization (among
other applications)is a special case of the negative binomial distribution,
itself a variation of the binomial distribution, which arises from a sum of n
Bernoulli random variables. Similarly, we also know that random variables
representing the number of occurrences of rare events in a nite interval of
length, area, volume, or time, are mostly Poisson distributed, except when
they are overdispersed, in which case the negative binomial distribution is
more appropriate.
Finally, we note that what began in this chapter for discrete random variables continues in the next chapter, employing the same approach, for continuous random variables. In fact, Chapter 9 picks up precisely where this chapter
leaves owith the Poisson random variable. It is advisable, therefore, before
engaging with the material in the next chapter, that what has been learned
from this chapter be consolidated by tackling as many of the exercises and
application problems found at the end of this chapter as possible.
Here, and in Table 8.2, is a summary of the main characteristics, models,
and other important features of the discrete random variables of this chapter.

244

Random Phenomena

Discrete uniform random variable, UD (k): a variable with k equiprobable

outcomes.
Bernoulli random variable, Bn(p): the outcome of a Bernoulli triala
trial with only two mutually exclusive outcomes, 0 or 1; Success or
Failure, etc, with respective probabilities p, and 1 p.
Hypergeometric random variable, H(n, Nd , N ): the number of defective items found in a sample of size n drawn from a population of size
N , of which the total number of defectives is Nd .
Binomial random variable, Bi(n, p): the total number of successes in
n independent Bernoulli trials with probability of success, p.
Multinomial random variable, M N (n, p1 , p2 , . . . , pk ): the total number
of times each mutually exclusive outcome i; i = 1, 2, . . . , k, occurs in n
independent trials, with probability of a single occurrence of outcome i,
pi .
Negative binomial random variable, N Bi(k, p): the total number of failures before the k th success, with probability of success, p.
Geometric random variable, G(p): the total number of failures before
the 1st success, with probability of success, p.
Poisson random variable, P(): the total number of (rare) events occurring in a nite interval of length, area, volume, or time, with mean rate
of occurrence, .

REVIEW QUESTIONS
1. What are the two primary benets of the rst principles approach to probability modeling as advocated in this chapter?
2. What are the four components of the model development and analysis strategy
outlined in this chapter?
3. What are the basic characteristics of the discrete uniform random variable?
4. What is the probability model for the discrete uniform random variable?
5. What are some examples of a discrete uniform random variable?
6. What are the basic characteristics of the Bernoulli random variable?
7. What is the connection between the discrete uniform and the Bernoulli random
variables?

Probability Model

Geometric
G(p)
Poisson
P()

f (x) = p(1 p)x1

x = 1, 2, . . .
x
e
f (x) = x!
x = 1, 2, . . .

np
npi

k(1 p)/p k(1 p)/p2

n, p
n, pi

k, p

1/p

= np

nNd
N

n, Nd , N

(1 p)/p2

npi (1 pi )

(q = 1 p)
np(1 p)

pq
(q = 1 p)

n
npq N
N 1

k(k+2)
12
(k1)(k+1)
12

limn Bi(n, p) = P()

limk N Bi(k, p) = P()

N Bi(1, p) = G(p);
limk N Bi(k, p) = P()
k(1p)
=
p
N Bi(1, p)

= Bi(n, p)
limn Bi(n, p) = P()
np =
Marginal
fi (xi ) Bi(n, pi )

limN H(n, Nd , N )

UD (2) = Bi(0.5)
X
in Bn(p)
i=1 Xi Bi(n, p)

Variance ( 2 ) Relation to
V ar(X)
Other Variables

k+1
2
k
2

Mean ()
E(X)

Characteristic
Parameters
k
k
p

Summary of probability models for discrete random variables

f (xi ) = k1 ; i = 1, 2, . . . , k
or i = 0, 1, 2, . . . , k 1
f (x = 0) = (1 p)
f (x = 1) = p
(Nd )(N Nd)
Hypergeometric
f (x) = x Nnx
(n)
H(n, Nd , N )

Binomial
f (x) = nx px (1 p)nx
x = 1, 2, . . . , n
Bi(n, p)
Multinomial
f (x1 , x2 , . . . , xk ) =
x1 x2
xk
n!
M N (n, pi )
1 p2 pk
x1 !x2 !xk ! p

xi = n; pi = 1
k
x
Negative Binomial f (x) = x+k+1
k1 p (1 p)
N Bi(k, p)
x = 1, 2, . . .

Random
Variable
Uniform
UD (k)
Bernoulli
Bn(p)

TABLE 8.2:

Ideal Models of Discrete Random Variables

245

246

Random Phenomena

8. What is a Bernoulli trial?

9. What are the two versions of the probability model for the Bernoulli random
variable?
10. What are the basic characteristics of the hypergeometric random variable?
11. What is the probability model for the hypergeometric random variable?
12. What do the parameters, n, Nd and N represent for the hypergeometric,
H(n, Nd , N ) random variable?
13. What are the basic characteristics of the binomial random variable?
14. What is the probability model for the binomial random variable?
15. What is the relationship between the hypergeometric and binomial random variables?
16. What is the relationship between the Bernoulli and binomial random variables?
17. Chebychevs inequality was used to establish what binomial random variable
result?
18. What are the basic characteristics of the trinomial random variable?
19. What is the probability model for the trinomial random variable?
20. What is the relationship between the trinomial and binomial random variables?
21. What is the probability model for the multinomial random variable?
22. What are the basic characteristics of the negative binomial random variable?
23. What is the probability model for the negative binomial random variable?
24. What is the connection between the negative binomial, Pascal and Polya distributions?
25. What is the relationship between the negative binomial and the geometric random variables?
26. What is the probability model for the geometric random variable?
27. The Poisson random variable can be obtained as a limiting case of which random
variables, and in what specic ways?
28. What are the basic characteristics of the Poisson random variable?

Ideal Models of Discrete Random Variables

247

29. What is the probability model for the Poisson random variable?
30. The Poisson model is most appropriate for what sort of phenomena?
31. What about the mean and the variance of the Poisson random variable is a
distinguishing characteristic of this random variable?
32. What is an overdispersed Poisson-like phenomena? Give a few examples.
33. What probability model is more appropriate for overdispersed Poisson-like
phenomena and why?

EXERCISES
8.1 Establish the results given in the text for the variance, MGF and CF for the
Bernoulli random variable.
8.2 Given a hypergeometric H(n, Nd , N ) random variable X for which n = 5, Nd = 2
and N = 10:
(i) Determine and plot the entire pdf, f (x) for x = 0, 1, 2, . . . , 5
(ii) Determine P (X > 1) and P (X < 2)
(iii) Determine P (1 X 3)
8.3 A crate of 100 apples contains 5 that are rotten. A grocer purchasing the crate
selects a sample of 10 apples at random and accepts the entire crate only if this
sample contains no rotten apples. Determine the probability of accepting the crate.
If the sample size is increased to 20, nd the new probability of accepting the crate.
8.4 From the expression for the pdf of a binomial Bi(np) random variable, X, establish that E(X) = np and V ar(X) = npq where q = (1 p).
8.5 Establish that if X1 , X2 , . . . , Xn are n independent Bernoulli random variables,
then:
n

Xi
X=
i=1

is a binomial random variable.

8.6 Given that X is a hypergeometric random variable with n = 10, Nd = 5, and
N = 20, compute f (x) for x = 0, 1, 2, . . . , 10, and compare with the corresponding
f (x) x = 0, 1, 2, . . . , 10, for a binomial random variable with n = 10, p = 0.25
8.7 Obtain the recursion formula
f (x + 1) = (n, x, p)f (x)
for the binomial pdf, and show that
(n, x, p) =

nx
x+1

(8.94)

248

Random Phenomena

Use this to determine the value x for which the pdf attains a maximum. (Keep in
mind that because f (x) is not a continuous function of x, the standard calculus approach of nding optima by taking derivatives and setting to zero is invalid. Explore
the nite dierence f (x + 1) f (x) instead.)
8.8 Given the joint pdf for the two-dimensional ordered pair (X1 , X2 ) of the trinomial random variable (see Eq (8.41)), obtain the conditional pdfs f (x1 |x2 ) and
f (x2 |x1 ).
8.9 Consider a chess player participating in a two-game pre-tournament qualication
series. From past records in such games, it is known that the player has a probability
pW = 0.75 of winning, a probability pD = 0.2 of drawing, and a probability pL = 0.05
of losing. If X1 is the number of wins and X2 is the number of draws, obtain the
complete joint pdf f (x1 , x2 ) for this player. From this, compute the marginal pdfs,
f1 (x1 ) and f2 (x2 ), and nally obtain the conditional pdfs f (x1 |x2 ) and f (x2 |x1 ).
8.10 (i) Establish the equivalence of Eq (8.51) and Eq (8.52), and also the equivalence of Eq (8.53) and Eq (8.52) when k is a positive integer.
(ii) If the negative binomial random variable is dened as the total number of trials (not failures) required to obtain exactly k successes, obtain the probability
model in this case and compare it to the model given in Eq (8.51) or Eq (8.52).
8.11 Obtain the recursion formula
f (x + 1) = (k, x, p)f (x)

(8.95)

for the negative binomial pdf, showing an explicit expression for (k, x, p). Use this
expression to determine the value x for which the pdf attains a maximum. (See
comments in Exercise 8.7.) From this expression, conrm that the geometric distribution is monotonically decreasing.
8.12 (i) Establish that E(X) for the geometric random variable is 1/p and that
V ar(X) = q/p2 , where q = 1 p.
(ii) Given that for a certain geometric random variable, P (X = 2) = 0.0475 and
P (X = 10) = 0.0315, determine P (2 X 10).
(iii) The average chain length of a polymer produced in a batch reactor is given as
200 units, where chain length itself is known to be a geometric random variable.
What fraction of the polymer product is expected to have chains longer than 200
units?
8.13 The logarithmic series random variable possesses the distribution
f (x) =

px
; 0 < p < 1; x = 1, 2, . . . ,
x

(8.96)

First show that the normalizing constant is given by:

1
ln(1 p)

(8.97)

and then establish the following mathematical characteristics of this random variable
and its pdf:

Ideal Models of Discrete Random Variables

249

Mean: E(X) = p/(1 p)

Variance: V ar(X) = p(1 p)(1 p)2
Moment generating function: M (t) = ln(1 pet )/ ln(1 p)
Characteristic function: (t) = ln(1 pejt )/ ln(1 p)
8.14 Establish that in the limit as k , the pdf for the negative binomial
N Bi(k, k/(k + )) random variable becomes the pdf for the Poisson P() random
variable.
8.15 Obtain the recursion formula
f (x + 1) = (, x)f (x)

(8.98)

for the Poisson pdf, showing an explicit expression for (, x). Use this expression
to conrm that for all values of 0 < < 1, the Poisson pdf is always monotonically
decreasing. Find the value x for which the pdf attains a maximum for > 1. (See
comments in Exercise 8.7.)
8.16 (i) Obtain the complete pdf, f (x), for the binomial random variable with
n = 10, p = 0.05, for x = 0, 1, . . . , 10, and compare it to the corresponding pdf,
f (x), for a Poisson variable with = 0.5.
(ii) Repeat (i) for n = 20, p = 0.5 for the binomial random variable, and = 10 for
the Poisson random variable.
8.17 Show that if Xi , i = 1, 2, . . . , n, are n independent Poisson random variables
each with parameter i , then the random variable Y dened as:
Y =

i=1

is also a Poisson random variable, with parameter =

n
i=1

i .

8.18 The number of yarn breaks per shift in a commercial ber spinning machine
is a Poisson variable with = 3. Determine the probability of not experiencing any
yarn break in a particular shift. What is the probability of experiencing more than
3 breaks per shift?
8.19 The probability of nding a single sh-eye gel particle (a solid blemish) on
a sq cm patch of a clear adhesive polymer lm is 0.0002; the probability of nding
more than one is essentially zero. Determine the probability of nding 3 or more
such blemishes on a 1 square meter roll of lm.
8.20 For a Poisson P() random variable, determine P (X 2) for = 0.5, 1, 2, 3.
Does the observed behavior of P (X 2) as increases make sense? Explain.
8.21 The number of eggs laid by a particular bird per mating season is a Poisson
random variable X, with parameter . The probability that any such egg successfully
develops into a hatchling is p (the probability that it does not survive is (1 p)).
Assuming mutual independence of the development of each egg, if Y is the random

250

Random Phenomena

variable representing the number of surviving hatchlings, establish that its probability distribution function is given by:
f (y) =

ep (p)y
y!

(8.99)

APPLICATION PROBLEMS
8.22 (i) A batch of 15 integrated-circuit chips contains 4 of an irregular type. If
from this batch 2 chips are selected at random, and without replacement, nd the
probability that: (a) both are irregular; (b) none is irregular; (c) only one of the
two is irregular.
(ii) If the random variable in problem (i) above was mistakenly taken to be a binomial random variable (which it is not), recalculate the three probabilities and
compare the corresponding results.
8.23 A pump manufacturer knows from past records that, in general, the probability
of a certain specialty pump working continuously for fewer than 2 years is 0.3; the
probability that it will work continuously for 2 to 5 years is 0.5, and the probability
that it will work for more than 5 years is 0.2. An order of 8 such pumps has just
been sent out to a customer: nd the probability that two will work for fewer than
2 years, ve will work for 2 to 5 years, and one will work for more than 5 years.
8.24 The following strategy was adopted in an attempt to determine the size, N ,
of the population of an almost extinct population of rare tigers in a remote forest
in southeast Asia. At the beginning of the month, 50 tigers were selected from the
population, tranquilized, tagged and released; assuming that a month is sucient
time for the tagged sample to become completely integrated with the entire population, at the end of the month, a random sample of n = 10 tigers was selected, two
of which were found to have tags.
(i) What does this suggest as a reasonable estimate of N ? Identify two key potential
sources of error with this strategy.
(ii) If X is the random variable representing the total number of tagged tigers found
in the sample of n taken at the end of the month, clearly, X is a hypergeometric
random variable. However, given the comparatively large size we would expect of N
(the unknown tiger population size), it is entirely reasonable to approximate X as
a binomial random variable with a probability of success parameter p. Compute,
for this (approximately) binomial random variable, the various probabilities that
X = 2 out of the sampled n = 10 when p = 0.1, p = 0.2 and p = 0.3. What does
this indicate to you about the more likely value of p, for the tiger population?
(iii) In general, for the binomial random variable X in (ii) above, given data that
x = 2 successes were observed in n = 10 trials, show that the probability that
X = 2 is maximized if p = 0.2.
8.25 The number of contaminant particles (aws) found on each standard size silicon wafer produced at a certain manufacturing site is a random variable, X, that
a quality control engineer wishes to characterize. A sample of 30 silicon wafers was
selected and examined for aws; the result (the number of aws found on each wafer)

Ideal Models of Discrete Random Variables

251

is displayed in the Table below.

4
3
3

1
0
4

2
0
1

3
2
1

2
3
2

1
0
2

2
3
5

4
2
3

0
1
1

1
2
1

(i) From this data set, obtain an empirical frequency distribution function, fE (x),
and compute E(X), the expected value of the number of aws per wafer.
(ii) Justifying your choice adequately but succinctly, postulate an appropriate theoretical probability model for this random variable. Using the result obtained in
(i) above for E(X), rounded to the nearest integer, compute from your theoretical
model, the probability that X = 0, 1, 2, 3, 4, 5 and 6, and compare these theoretical
probabilities to the empirical ones from (i).
(iii) Wafers with more than 2 aws cannot be sold to customers, resulting in lost revenue; and the manufacturing process ceases to be economically viable if more than
30% of the produced wafers fall into this category. From your theoretical model,
determine whether or not the particular process giving rise to this data set is still
economically viable.
8.26 An ensemble of ten identical pumps arranged in parallel is used to supply water
to the cooling system of a large, exothermic batch reactor. The reactor (and hence
the cooling system) is operated for precisely 8 hrs every day; and the data set shown
in the table below is the total number of pumps functioning properly (out of the
ten) on any particular 8-hr operating-day, for the entire month of June.
Generate a frequency table for this data set and plot the corresponding histogram. Postulate an appropriate probability model. Obtain a value for the average
number of pumps out of 10 that are functioning every day; use this value to obtain an estimate of the model parameters; and from this compute a theoretical pdf.
Compare the theoretical pdf with the relative frequency distribution obtained from
the data and comment on the adequacy of the model.
Available
Pumps
9
10
9
8
7
7
9
7
7
8

Day
(in June)
1
2
3
4
5
6
7
8
9
10

Available
Pumps
6
8
9
9
8
9
7
7
7
5

Day
(in June)
11
12
13
14
15
16
17
18
19
20

Available
Pumps
8
9
4
9
9
10
8
5
8
8

Day
(in June)
21
22
23
24
25
26
27
28
29
30

8.27 In a study of the failure of pumps employed in the standby cooling systems of
commercial nuclear power plants, Atwood (1986)2 determined as 0.16 the probability
that a pump selected at random in any of these power plants will fail. Consider a
2 Atwood, C.L., (1086). The binomial failure rate common cause model, Technometrics,
28, 139-148.

252

Random Phenomena

system that employs 8 of these pumps but which, for full eectiveness, really requires
only 4 to be functioning at any time.
(i) Determine the probability that this particular cooling system will function with
full eectiveness.
(ii) If a warning alarm is set to go o when there are ve or fewer pumps functioning
at any particular time, what is the number of times this alarm is expected to go o
in a month of 30 days? State any assumptions you may need to make.
(iii) If the probability of failure increases to 0.2 for each pump, what is the percentage increase in the probability that four or more pumps will fail?
8.28 The table below contains data from Greenwood and Yule, 19203 , showing the
frequency of accidents occurring, over a ve-week period, to 647 women making high
explosives during World War I.
Number
of Accidents
0
1
2
3
4
5+

Observed
Frequency
447
132
42
21
3
2

(i) For this clearly Poisson-like phenomenon, let X be the random variable representing the number of accidents. Determine the mean and the variance of X. What
does this indicate about the possibility that this may in fact not be a true Poisson
random variable?
(ii) Use the value computed for the data average as representative of for the Poisson
distribution and obtain the theoretical Poisson model prediction of the frequency of
occurrences. Compare with the observed frequency.
(iii) Now consider representing this phenomenon as a negative binomial random
variable. Determine k and p from the computed data average and variance; obtain a
theoretical prediction of the frequency of occurrence based on the negative binomial
model and compare with the observed frequency.
(iv) To determine, objectively, which probability model provides a better t to this
data set, let fio represent the observed frequency associated with the ith group, and
let i represent the corresponding theoretical (expected) frequency. For each model,
compute the index,
m

(fio i )2
C2 =
(8.100)
i
i=1
(For reasons discussed in Chapter 17, it is recommended that group frequencies
should not be smaller than 5; as such, the last two groups should be lumped into
one group, x 4.) What do the results of this computation suggest about which
model provides a better t to the data? Explain.
8.29 Sickle-cell anemia, a serious condition in which the body makes sickle-shaped
3 Greenwood M. and Yule, G. U. (1920) An enquiry into the nature of frequency distributions representative of multiple happenings with particular reference of multiple attacks
of disease or of repeated accidents. Journal Royal Statistical Society 83:255-279.

Ideal Models of Discrete Random Variables

253

red blood cells, is an inherited disease. People with the disease inherit two copies of
the sickle cell geneone from each parent. On the other hand, those who inherit only
one sickle cell gene from one parent and a normal gene from the other parent have
a condition called sickle cell trait; such people are sometimes called carriers
because while they do not have the disease, they nevertheless carry one of the
genes that cause it and can pass this gene to their children.
Theoretically, if two sickle-cell carriers marry, the probability of producing
an ospring with the sickle-cell disease is 0.25, while the probability of producing
osprings who are themselves carriers is 0.5; the probability of producing children
with a full complement of normal genes is 0.25.
(i) If a married couple who are both carriers have four children, what is the joint
probability distribution of the number of children with sickle cell anemia and the
number of carriers?
(ii) From this joint probability distribution, determine the probabilities of having
(a) no children with the disease and 2 carriers; (b) 1 child with the disease and 2
carriers; (c) two children with the disease and 2 carriers.
(iii) On the condition that exactly one of the 4 children is a carrier, determine the
probability of having (a) no children with the disease; (b) 1 child with the disease;
(c) 2 children with the disease, and (d) 3 children with the disease.
8.30 Revisit Application Problem 8.29 above and now consider that a married couple, both of whom are carriers, lives in a country where there is no health care
coverage, so that each family must cover its own health care costs. The couple, not
knowing that that are both carriers, proceed to have 8 children. A child with the
sickle-cell disease will periodically experience episodes called crises that will require hospitalization and medication for the symptoms (there are no cures yet for the
disease). Suppose that it costs the equivalent of US$2,000 a year in general hospital
costs and medication to treat a child with the disease.
(i) What annual sickle-cell disease-related medical cost can this family expect to
incur?
(ii) If these crisis episodes occur infrequently at an average rate of 1.5 per year
in this country (3 every two years), and these occurrences are well-modeled as a
Poisson-distributed random variable, what is the probability that this family will
have to endure a total of 3 crisis episodes in one year? Note that only a child with
the disease can have a crisis episode. (Hint: See Exercise 8.21.)
8.31 When a rare respiratory disease with a long incubation period infects a population of people, there is only a probability of 1/3 that an infected patient will
show the symptoms within the rst month. When ve such symptomatic patients
showed up in the only hospital in a small town, the astute doctor who treated these
patients knew immediately that more will be coming in the next few months as the
remaining infected members of the population begin to show symptoms. Assume
that all symptomatic patients will eventually come to this one hospital.
(i) Postulate an appropriate probability model and use it to determine the most
likely number of infected but not yet symptomatic patients, where x is considered
the most likely number if P (X = x ) is the highest for all possible values of x.
(ii) Because of the nature of the disease, if a total of more than 15 people are infected,
the small town will have to declare a state of emergency. What is the probability of

254

Random Phenomena

this event happening?

8.32 According to the 1986 Statistical Abstracts of the United States, in the veyear period from 19771981, failures among banks insured by the Federal Deposit
Insurance Corporation (FDIC) averaged approximately 8.5 per year. Specically, 10
failures were reported in 1980. If FDIC-insured bank failures are considered rare
events,
(i) Postulate an appropriate model for the random variable X representing the total
number of FDIC-insured bank failures per year, and use this model to compute the
probability of observing the number of failures reported in 1980.
(ii) What is the most likely number of failures in any one year, if this number,
x , is so designated because f (x ) is the highest probability of all possible values of
x? Determine the probability of having more failures in one year than this most
likely number of failures.
(iii) An FDIC quality control inspector suggested that the occurrence of 13 or
more failures in one year should be considered cause for alarm. What is the probability of such an event occurring and why do you think that such an event should
truly be a cause for alarm?
8.33 According to the Welders Association of America, during the decade from
19801989, 40% of injuries incurred by its members were to the eye; 22% were to
the hand; 20% to the back; and the remaining 18% of injuries were categorized as
others. Stating whatever assumptions are necessary,
(i) Determine the probability of recording 4 eye injuries, 3 hand injuries, 2 hand
injuries, and 1 injury of the other variety.
(ii) Of 5 total recorded injuries, what is the probability that fewer than 2 are eye
injuries?
(iii) Because eye injuries are the most prevalent, and the most costly to treat, it is
desired to reduce their occurrences by investing in eye-safety training programs for
the associations members. What target probability of a single occurrence of an eye
injury should the program aim for in order to achieve an overall objective of increasing to approximately 0.9 the probability of observing fewer than 2 eye injuries?
8.34 On January 28, 1986, on what would be the space shuttle programs 25th
mission, the space shuttle Challenger exploded. The cause of the accident has since
been identied as a failure in the O-ring seal in the solid-fuel booster rocket. A
1983 study commissioned by the Air Force concluded, among other things, that the
probability of a catastrophic space shuttle accident due to booster rocket failure is
1/35. Stating whatever assumptions are necessary,
(i) Determine the probability of attempting 25 missions before the rst catastrophe
attributable to a booster rocket failure occurs.
(ii) Determine the probability that the rst catastrophe attributable to a booster
rocket failure will occur within the rst 25 mission attempts (i.e. on or before the
25th mission). What does this result imply about the plausibility of the occurrence
of this catastrophe at that particular point in time in the history of the space shuttle
program?
(iii) An independent NASA study published in 1985 (just before the accident)
claimed that the probability of such a catastrophe happening was 1/60,000. Repeat (ii) above using this value instead of 1/35. In light of the historical fact of the

Ideal Models of Discrete Random Variables

255

Jan 28, 1986 indicent, discuss which estimate of the probability the catastrophes
occurrence is more believable?

8.35 A study in Kalbeisch et al., 19914 reported that the number of warranty
claims for one particular system on a particular automobile model within a year
of purchase is well-modeled as a Poisson distributed random variable, X, with an
average rate of = 0.75 claims per car.
(i) Determine the probability that there are two or fewer warranty claims on this
specic system for a car selected at random.
(ii) Consider a company that uses the warranty claims on the various systems of
the car within the rst year of purchase to rate cars for their initial quality. This
company wishes to use the Kalbeisch et al. results to set an upper limit, xu , on
the number of warranty claims whereby a car is declared of poor initial quality
if the number of claims equals or exceeds this number. Determine the value of xu
such that, given = 0.75, the probability of purchasing a car which, by pure chance
alone, will generate more than xu warranty claims, is 0.05 or less.

8.36 In the 1940s, the entomologist S. Corbet catalogued the butteries of Malaya
and obtained the data summarized in the table below. The data table shows x, a
count of the number of species, x = 1, 2, . . . 24, and the associated actual number
of butteries caught in light-traps in Malaya that have x number of species. For
example, there were 118 single-species butteries; 74 two-species butteries (for a
total of 148 of such butteries) etc, and the last entry indicates that there were 3
categories of 24-species butteries. Corbet later approached the celebrated R. A.
Fisher for assistance in analyzing the data. The result, presented in Fisher et al.,
19435 , is a record of how the logarithmic series distribution (see Exercise 8.13) was
developed as a model for describing species abundance.
Given the characteristics of the logarithmic series distribution in Exercise 8.13,
obtain an average for the number of species, x, and use this to obtain a value for

the parameter p (and hence also ). Obtain a predicted frequency (x)

and compare with the values observed in the Corbet data. Comment on the adequacy of this
probability model that is now widely used by entomologists for characterizing the
distribution of species abundance.

4 Kalbeisch, J.D., Lawless, J.F., and Robinson, J.A. (1991). Methods for the analysis
and prediction of warranty claims. Technometrics, 33, 273285.
5 Fisher, R. A., S. Corbet, and C. B. Williams. (1943). The relation between the number
of species and the number of individuals in a random sample of an animal population.
Journal of Animal Ecology, 1943: 4258.

256

Random Phenomena
x
No of species
(x)
Observed Frequency
x
No of species
(x)
Observed Frequency
x
No of species
(x)
Observed Frequency

118

Chapter 9
Ideal Models of Continuous Random
Variables

9.1

9.2

9.3

Gamma Family Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.1.1 The Exponential Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics and Model Development . . . . . . . . . . . . . . . . .
The Model and Some Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1.2 The Gamma Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics and Model Development . . . . . . . . . . . . . . . . .
The Model and Some Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1.3 The Chi-Square Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics and Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1.4 The Weibull Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics and Model Development . . . . . . . . . . . . . . . . .
The Model and Some Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1.5 The Generalized Gamma Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1.6 The Poisson-Gamma Mixture Distribution . . . . . . . . . . . . . . . . . . . . . . .
Gaussian Family Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.1 The Gaussian (Normal) Random Variable . . . . . . . . . . . . . . . . . . . . . . . .
Background and Model Development . . . . . . . . . . . . . . . . . . . . . . . . . .
The Model and Some Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.2 The Standard Normal Random Variable . . . . . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
9.2.3 The Lognormal Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics and Model Development . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.4 The Rayleigh Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.5 The Generalized Gaussian Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ratio Family Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.1 The Beta Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics and Model Development . . . . . . . . . . . . . . . . .
The Model and Some Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
The Many Shapes of the Beta Distribution . . . . . . . . . . . . . . . . . . . .

259
259
260
261
261
262
264
264
265
266
268
271
271
271
272
272
272
273
274
275
275
276
278
279
279
287
288
289
290
292
292
292
293
296
297
298
300
300
300
301
301
302
302
303

257

258

Random Phenomena
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Extensions and Special Cases of the Beta Random Variable . . . . .
Generalized Beta Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Inverted Beta Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.3 The (Continuous) Uniform Random Variable . . . . . . . . . . . . . . . . . . . .
Basic Characteristics, Model and Remarks . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.4 Fishers F Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics, Model and Remarks . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.5 Students t Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics, Model and Remarks . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.6 The Cauchy Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics, Model and Remarks . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.2

9.4

303
306
307
307
308
308
308
309
309
309
310
310
311
311
312
314
314
314
315
316
316
317
323
329

Facts which at rst seem improbable

will, even on scant explanation,
drop the cloak which has hidden them
and stand forth in naked and simple beauty.
Galileo Galilei (15621642)

The presentation of ideal models of randomly varying phenomena that began

with discrete random variables in the last chapter concludes in this chapter
with the same rst principles approach applied to continuous random variables. The random variables encountered in this chapter include some of the
most celebrated in applied statisticscelebrated because of the central role
they play in the theory and practice of statistical inference. Some of these
random variables and their pdfs may therefore already be familiar, even to
the reader with only rudimentary prior knowledge. But, our objective is not
merely to familiarize the reader with continuous random variables and their
pdfs; it is to reveal the subterraneous roots from which these pdfs sprang,
especially the familiar ones. To this end, and as was the case with the discrete
random variables of the previous chapter, each model will be derived from the
underlying phenomenological characteristics, beginning with the simplest and
building up to models for more complex random variables that are themselves
functions of several simpler random variables.
The upcoming discussion, even though not exhaustive, is still quite extensive in scope. Because we will be deriving probability models for more than
15 of the most important continuous random variables of practical impor-

Ideal Models of Continuous Random Variables

259

tance, to facilitate the discussion and also promote fundamental understanding, these random variables and their pdfs will be presented in families
cohort groups that share common structural characteristics. That the starting point of the derivations for the most basic of these families of continuous
random variables is a discrete random variable may be somewhat surprising at rst, but this is merely indicative of the sort of intriguing connections
(some obvious, others not) between these random variablesboth continuous
and discrete. A chart included at the end of the chapter summarizes these
connections and places in context how all the random variables discussed in
these two chapters are related to one another.

9.1

Gamma Family Random Variables

Our discussion of continuous random variables and their probability models begins with the Gamma Family whose 4 distinct members, from the
simplest (in terms of underlying phenomena) to the most complex, are:

The Exponential random variable,

The Gamma random variable,
The Chi-square random variable, and
The Weibull random variable.

These random variables are grouped together because they share many
common structural characteristics, the most basic being non-negativity: they
all take values restricted to the positive real line, i.e. 0 < x < . Not surprisingly, they all nd application in problems involving intrinsically non-negative
entities. Specically, 3 of the 4 (Exponential, gamma, and Weibull) frequently
nd application in system reliability and lifetime studies, which involve waiting times until some sort of failure. Structurally, these three are much closer
together than the fourth, Chi-square, which nds application predominantly in
problems involving a dierent class of non-negative variables: mostly squared
variables including variances. Its membership in the family is by virtue of being a highly specialized (and somewhat unusual) case of the gamma random
variable.

260

9.1.1

Random Phenomena

The Exponential Random Variable

Basic Characteristics and Model Development

Let us pick up where we left o in Chapter 8 by considering Poisson events
occurring at a constant intensity , (the mean number of successes per
unit interval size). We wish to consider the random variable, X, representing
the total interval size (length of time, spatial length, area, volume, etc) until we
observe the rst occurrence of such Poisson events since the last observation,
i.e. the inter-event interval size. (For example, the lifetime of a light bulb
the elapsed time until the lament burns out, or the elapsed time in between
the arrival of successive customers at a small town post oce counter; the
distance between successive aws on a piece of bre-optics cable; etc.) The
random variable, X, dened in this manner is an exponential random variable,
and its model may be derived from this simple description of its fundamental
characteristics as follows.
First, without loss of generality, and simply to help maintain focus on
essentials, we shall consider the interval over which the events are happening
as time, even though it could equally well be length or area or volume. Let us
then consider the random variable, T , the waiting time until we observe the
rst occurrence of these Poisson events.
Now, let Y (t) be the random variable representing the total number of
occurrences in the interval (0, t), by denition a discrete Poisson random variable, P(t). If, as stated above, T is the time to the rst occurrence, then
observe that the following probability statement is true:
P [Y (t) < 1] = P [T > t]

(9.1)

because the two mathematical events, E1 = {y : Y (t) < 1} and E2 = {t : T >

t}, are equivalent. In words: if Y (t) is not yet 1 (event E1 ), i.e. if we have not
yet observed 1 occurrence, then it is because the current time, t, is less than
the waiting time until the rst occurrence (event E2 ); or equivalently, we have
not waited long enough because T , the waiting time to the rst occurrence,
is longer than current time t.
Since Y is a Poisson random variable with intensity , we know that:
f (y) =

(t)y et
; y = 0, 1, 2, . . . ,
y!

(9.2)

and since P [Y (t) < 1] = P [Y (t) = 0], we obtain from Eq. (9.2) that the
expression in Eq (9.1) immediately becomes:
P [T > t] = et , or
1 FT (t)

= et

(9.3)

where FT is the cumulative distribution function of the random variable T ,

so that:
(9.4)
FT (t) = 1 et

Ideal Models of Continuous Random Variables

261

Upon dierentiating once with respect to t, we obtain the required pdf as

f (t) = et ; 0 < t < ,

(9.5)

the pdf for the exponential random variable, T , the waiting time until the rst
occurrence of Poisson events occurring at a constant mean rate, . This result
generalizes straightforwardly from time to spatial intervals, areas, volumes,
etc.
The Model and Some Remarks
In general, the expression
f (x) = ex ; 0 < x <

(9.6)

or, for = 1/,

f (x) =

1 x/
e
;0 < x <

(9.7)

is the pdf for an exponential random variable, E().

Recall that we had encountered this same pdf earlier in Chapter 2 as a
model for the distribution of residence times in a perfectly mixed continuous
stirred tank reactor (CSTR). That model was derived strictly from chemical
engineering principles of material balance, with no appeal to probability; this
chapters model arose directly from probabilistic arguments originating from a
discrete random variable model, with no appeal to physics or engineering. This
connection between the highly specialized chemical engineering model and
the generic exponential pdf emphasizes the waiting time phenomenological
attribute of the exponential random variable.
It is also important to note that the geometric distribution discussed in
Chapter 8 is a discrete analog of the exponential distribution. The formers
phenomenological attribute the number of discrete trials until the occurrence of a success is so obviously the discrete equivalent of the continuous
interval size until the occurrence of a Poisson success that characterizes the
latter. Readers familiar with process dynamics and control might also see in
the exponential pdf an expression that reminds them of the impulse response
of a linear, rst order system with steady state gain 1, and time constant, :
in fact the two expressions are identical (after all, the expression in Chapter
2 was obtained as a response of a rst order ODE model to an impulse stimulus function). For the purposes of the current discussion, however, the relevant point is that these same readers may now observe that the discrete-time
(sampled-data) version of this impulse response bears the same comparison to
the expression for the geometric pdf. Thus, the geometric pdf is to the exponential pdf precisely what the discrete time (sampled-data) rst order system
impulse response function is to the continuous time counterpart.

262

Random Phenomena

1.0
E

0.8

f(x)

0.6

0.4

0.2

0.0

FIGURE 9.1: Exponential pdfs for various values of parameter

Important Mathematical Characteristics
The following are some important mathematical characteristics of the exponential random variable, E(), and its pdf:
1. Characteristic parameter: (or = 1/)the scale parameter; it
determines how wide the distribution is, with larger values of corresponding to wider distributions (See Fig 9.1).
2. Mean: = E(X) = .
(Other measures of central location: Mode = 0; Median = ln 2.)
3. Variance: 2 (X) = 2
4. Higher Moments: Coecient of Skewness: 3 = 2; Coecient of Kurtosis: 4 = 9, implying that the distribution is positively skewed and
sharply peaked. (See Fig 9.1).
5. Moment generating and Characteristic functions:
M (t)

(t)

6. Survival function: S(x) = ex/

Hazard function: h(x) = 1/ =

1
(1 t)
1
(1 jt)

(9.8)
(9.9)

Ideal Models of Continuous Random Variables

263

Applications
The exponential pdf, not surprisingly, nds application in problems involving waiting times to the occurrence of simple events. As noted earlier, it
is most recognizable to chemical engineers as the theoretical residence time
distribution function for ideal CSTRs; it also provides a good model for the
distribution of time intervals between arrivals at a post oce counter, or between phone calls at a customer service center.
Since equipment (or system) reliability and lifetimes can be regarded as
waiting times until failure of some sort or another, it is also not surprising
that the exponential pdf is utilized extensively in reliability and life testing
studies. The exponential pdf has been used to model lifetimes of simple devices
and of individual components of more complex ones. In this regard, it is important to pay attention to the last characteristic shown above: the constant
hazard function, h(x). Recall from Chapter 4 that the hazard function allows
one to compute the probability of surviving beyond time t, given survival up
to time t. The constant hazard function indicates that for the exponential
random variable, the risk of future failure is independent of current time. The
exponential pdf is therefore known as a memoryless distribution the only
distribution with this characteristic.
Example 9.1 WAITING TIME AT SERVICE STATION
The total number of trucks arriving at an allnight service station over
the 10:00 pm to 6:00 am night shift is known to be a Poisson random
variable with an average hourly arrival rate of 5 trucks/hour. The waiting time between successive arrivalsidle time for the service station
workerstherefore has the exponential distribution:
f (x) =

1 x/
;0 < x <
e

(9.10)

where = 1/5 hours. If the probability is exactly 0.5 that the waiting
time between successive arrivals is less than hours, nd the value of .
Solution:
The problem statement translates to:

5e5x dx = 0.5
P [X < ] =

(9.11)

which, upon carrying out the indicated integration, yields:

or,

e5x 0 = 0.5

(9.12)

1 e5 = 0.5

(9.13)

which simplies to give the nal desired result:

ln 0.5
= 1.39
5

(9.14)

264

Random Phenomena
Note, of course, that by denition, is the median of the given pdf. The
practical implication of this result is that, in half of the arrivals at the
service station during this night shift, the waiting (idle) time in between
arrivals (on average) will be less than = 1.39 hours; the waiting time
will be longer than for the other half of the arrivals. Such a result
can be used in practice in many dierent ways: for example, the owner
of the service station may use this to decide when to hire extra help
for the shift, say when the median idle time exceeds a predetermined
threshold.

9.1.2

The Gamma Random Variable

Basic Characteristics and Model Development

As a direct generalization of the exponential random variable, consider
X, the random variable dened as the interval size until the k th occurrence
of Poisson events occurring at a constant intensity (e.g., waiting time
until the occurrence of k independent events occurring in time at a constant
mean rate ).
Again, without loss of generality, and following precisely the same arguments as presented for the exponential random variable, let Y (t) be the total
number of occurrences in the interval (0, t), but this time, let the random
variable, T , be the time to the k th occurrence. In this case, the following
probability statement is true:
P [Y (t) < k] = P [T > t]

(9.15)

In terms of the implied mathematical events, this statement translates as

follows: if Y (t) is not yet k (i.e., if we have not yet observed as many as k
total occurrences in the interval (0, t), then it is because the current time, t,
is less than the waiting time until the k th occurrence, (i.e. we have not waited
long enough because T , the waiting time to the k th occurrence, is longer than
current time t).
Again, since Y is a Poisson random variable with intensity , we know
that:
k1

(t)y
(9.16)
P [Y (t) < k] = P [Y (t) (k 1)] =
et
y!
y=0
The next step in our derivation requires the following result for evaluating the
indicated summation:

k1
y

1
t (t)
=
e
ez z k1 dz
(9.17)
y!
(k

1)!
t
y=0
This result may be obtained (see Exercise 9.6) by rst dening:

I(k) =
ez z k1 dz
a

(9.18)

Ideal Models of Continuous Random Variables

265

and then showing that:

I(k) = ak1 ea + (k 1)I(k 1)

(9.19)

from where, by recursion, one may then obtain

(t)y
I(k)
=
et
(k 1)! y=0
y!
k1

(9.20)

And now, if FT (t) is the cumulative distribution function of the random variable, T , then the right hand side of Eq (9.15) may be rewritten as
P [T > t] = 1 FT (t)
so that the complete expression in Eq (9.15) becomes

1
1 FT (t) =
ez z k1 dz
(k 1)! t

(9.21)

(9.22)

Upon dierentiating with respect to t, using Leibnitzs formula for dierentiating under the integral sign, i.e.,
d
dx

B(x)

f (x, r)dr =
A(x)

dB
dA
f (x, r)
dr + f (x, B)
f (x, A)
x
dx
dx

we obtain:
f (t) =

1 t
e (t)k1
(k)

(9.23)

(9.24)

where (k) is the Gamma function (to be dened shortly), and we have used
the fact that for integer k,
(k 1)! = (k)

(9.25)

The nal result,

f (t) =

k t k1
e t
;0 < t <
(k)

(9.26)

is the pdf for the waiting time to the k th occurrence of independent Poisson
events occurring at an average rate ; it is a particular case of the pdf for a
gamma random variable, generalized as follows.
The Model and Some Remarks
The pdf for the gamma random variable, X, is given in general by
f (x) =

1
ex/ x1 ; 0 < x <
()

(9.27)

266

Random Phenomena

the same expression as in Eq (9.26), with = 1/ and replacing k with a

general real number not restricted to be an integer.
Some remarks are in order here. First, the random variable name arises
from the relationship between the pdf in Eq (9.27) and the Gamma function
dened by:

() =

ey y 1 dy

(9.28)

If we let y = x/, we obtain:

() =

ex/ x1 dx

(9.29)

with the immediate implication that:

1
1=
ex/ x1 dx
() 0

(9.30)

indicating that the function being integrated on the RHS is a density function.
Note also from Eq (9.28), that via integration by parts, one can establish the
following well-known recursion property of the gamma function:
( + 1) = ()

(9.31)

from where it is straightforward to see that if is restricted to be a positive

integer, then
() = ( 1)!
(9.32)
as presented earlier in Eq (9.25).
Second, in the special case when is an integer (say k, as in the preceding
derivation), the gamma distribution is known as the Erlang distribution, in
honor of the Danish mathematician and engineer, A. K. Erlang (1878-1929)
who developed the pdf while working for the Copenhagen Telephone Company.
In an attempt to determine how many circuits were needed to provide an acceptable telephone service, Erlang studied the number of telephone calls made
to operators at switching stations, in particular, the time between incoming
calls. The gamma distribution formally generalizes the Erlang distribution by
introducing the real number in place of Erlangs integer, k, and replacing
the (k 1)! with the gamma function.
Finally, observe that when = 1, the resulting pdf is precisely the exponential pdf obtained in the previous subsection. Thus, the exponential random
variable is a special case of the gamma random variable, a result that should
not be surprising given how the pdf for each random variable was derived.
Important Mathematical Characteristics
The following are some important mathematical characteristics of the
gamma random variable, (, ), and its pdf:

Ideal Models of Continuous Random Variables

267

0.4
D

D E

f(x)

0.3

0.2

D E
0.1
D E
0.0

D E
40

60
x

100

120

FIGURE 9.2: Gamma pdfs for various values of parameter and : Note how with
increasing values of the shape becomes less skewed, and how the breadth of the
distribution increases with increasing values of
1. Characteristic parameters: > 0 ; > 0
, the shape parameter, determines overall shape of the distribution
(how skewed or symmetric, peaked or at);
, the scale parameter, determines how wide the distribution is, with
larger values of corresponding to wider distributions (See Fig 9.2).
2. Mean: = E(X) = .
Other measures of central location: Mode = ( 1); 1
Median: No closed-form analytical expression.
3. Variance: 2 (X) = 2
4. Higher Moments: Coecient of Skewness: 3 = 21/2 ;
Coecient of Kurtosis: 4 = 3 + 6/,
implying that the distribution is positively skewed but becomes less
so with increasing , and sharply peaked, approaching the normal
reference kurtosis value of 3 as . (See Fig 9.2).
5. Moment generating and Characteristic functions:
M (t) =
(t)

1
(1 t)
1
(1 jt)

(9.33)
(9.34)

268

Random Phenomena

6. Survival function:
S(x) = ex/

1

(x/)i
i=0

(9.35)

valid for the Erlang variable ( = integer)

Hazard function:
h(x) =

(x/)1
1
i
() i=0 (x/)
i!

(9.36)

7. Relation to the exponential random variable: If Yi is an exponential random variable with characteristic parameter , i.e. Yi E(),
then the random variable X dened as follows:
X=

(9.37)

i=1

is the gamma random variable (, ). In words: a sum of independent

and identically distributed exponential random variables is a gamma
random variable (more precisely an Erlang random variable because
will have to be an integer). This result is intuitive from the underlying
characteristics of each random variable as presented in the earlier derivations; it is also straightforward to establish using results from Chapter
6 (See Exercise 9.8).
8. Reproductive Properties: The gamma random variable possesses
the following useful property: If Xi , i = 1, 2, . . . , n, are n independent
gamma random variables each with dierent shape parameters i but a
common scale parameter , i.e. Xi (i , ), then the random variable
Y dened as:
n

Xi
(9.38)
Y =
i=1

n
is also a gamma random variable, with shape parameter = i=1 i
and scale parameter , i.e. Y ( , ). Thus, a sum of gamma random
variables with identical scale parameters begets another gamma random
variable with the same scale parameter, hence the term reproductive.
(Recall Example 6.6 in Chapter 6.) Furthermore, the random variable
Z dened as
n

Xi
(9.39)
Z =c
i=1

where c is a constant,
is also a gamma random variable with shape

parameter = ni=1 i but with scale parameter c, i.e. Z ( , c).
(See Exercise 9.9.)

Ideal Models of Continuous Random Variables

269

Applications
The gamma pdf nds application in problems involving system time-tofailure when system failure occurs as a result of independent subsystem
failures, each occurring at a constant rate 1/. A standard example is the socalled standby redundant system consisting of n components where system
function requires only one component, with the others as backup; when one
component fails, another takes over automatically. Complete system failure
therefore does not occur until all n components have failed. For similar reasons,
the gamma pdf is also used to study and analyze time between maintenance
operations. Because of its exible shape, the gamma distribution is frequently
considered in modeling engineering data of general non-negative phenomena.
As an extension of the application of the exponential distribution in residence time distribution studies in single ideal CSTRs, the gamma distribution
may be used for the residence time distribution in several identical CSTRs in
series.
The gamma pdf is also used in experimental and theoretical neurobiology,
especially in studies involving action potentials the spike trains generated by neurons as a result of nerve-cell activity. These spike trains and the
dynamic processes that cause them are random; and it is known that the distribution of interspike intervals (ISI) the elapsed time between the appearance of two consecutive spikes in the spike train encode information about
synaptic mechanisms1 . Because action potential are generated as a result of a
sequence of physiological events, the ISI distribution is often well-modeled by
the gamma pdf2 .
Finally, the gamma distribution has been used recently to model the distribution of distances between DNA replication origins in cells. Fig 9.3, adapted
from Chapter 7 of Birtwistle (2008)3 , shows a (5.05, 8.14) distribution t to
data reported in Patel et al., (2006)4, on inter-origin distances in the budding yeast S. cerevisiae. Note the excellent agreement between the gamma
distribution model and the experimental data.
Example 9.2 REPLACING AUTOMOBILE TIMING BELTS
Automobile manufacturers specify in their maintenance manuals the
recommended mileage at which various components are to be replaced. One such component, the timing belt, must be replaced before
it breaks, since a broken timing belt renders the automobile entirely
inoperative. An extensive experimental study carried out before the ofcial launch of a new model of a certain automobile concluded that X,
1 Braitenberg, 1965: What can be learned from spike interval histograms about synaptic
mechanism? J. Theor. Biol. 8, 419425.
2 H.C. Tuckwell, 1988, Introduction to Theoretical Neurobiology, Vol 2: Nonlinear and
Stochastic Theories, Chapter 9, Cambridge University Press.
3 Birtwistle, M. R. (2008). Modeling and Analysis of the ErbB Signaling Network: From
Single Cells to Tumorigenesis, PhD Dissertation, University of Delaware.
4 Patel, P. K., Arcangioli, B., Baker, S. P., Bensimon, A. and Rhind, N. (2006). DNA
replication origins re stochastically in ssion yeast. Mol Biol Cell 17, 308-316.

270

Random Phenomena

20
Frequency %

DE
15

40
60
80
Interorigin distance, kb

100

FIGURE 9.3: Gamma distribution t to data on inter-origin distances in the budding

yeast S. cerevisiae genome

the lifetimes of the automobiles timing belts (in 10,000 driven miles),
is well-modeled by the following pdf:
f (x) =

1
ex x9 ; 0 < x <
(10)

(9.40)

The manufacturer wishes to specify as the recommended mileage (in

10,000 driven miles) at which the timing belt should be replaced, a
value x , which is the most likely of all possible values of X, because
it maximizes the given f (x). Determine x and compare it with the
expected value, E(X) (i.e. the average mileage at failure).
Solution:
First, observe that the given pdf is that of a gamma random variable
with parameters = 10, and = 1. Next, from the problem statement
we see that the desired x is the mode of this gamma pdf, in this case:
x = ( 1) = 9

(9.41)

with the implication that the timing belt should be changed at or before
90,000 driven miles.
For this specic problem, E(X) = = 10, indicating that the
expected value is 100,000 miles, a value that is longer than the value at
the distributions mode. It appears therefore as if choosing x = 90, 000
miles as the recommended mileage for the timing belt replacement is a
safe and reasonable, if conservative choice. However, since the breakage
of a timing belt while the car is in operation on the road is highly

Ideal Models of Continuous Random Variables

271

undesirable, recommending replacement at 90,000 miles rather than at

the expected average lifetime of 100,000 miles makes sense.

9.1.3

The Chi-Square Random Variable

Basic Characteristics and Model

An important special case of the gamma random variable arises when
= 2 and = r/2 with r a positive integer; this is known as the Chisquare random variable, with the following model, obtained directly from
the gamma pdf:
f (x) =

r
1
ex/2 x 2 1 ; 0 < x <
2r/2 (r/2)

(9.42)

This is the pdf for a 2 (r) random variable with r degrees of freedom. In
particular when r = 1, the resulting random variable, 2 (1), has the pdf:
f (x) =

1
ex/2 x1/2 ; 0 < x <
2(1/2)

(9.43)

Important Mathematical Characteristics

The following mathematical characteristics of the 2 (r) random variable
and its pdf derive directly from those of the gamma random variable:
1. Characteristic parameter: r, a positive integer, is the shape parameter, better known as the degrees of freedom for reasons discussed later
in Part IV in the context of the 2 (r) random variables most signicant
application in statistical inference.
2. Mean: = r; Mode = (r 2); r 2
3. Variance: 2 = 2r
%
4. Higher Moments: Coecient of Skewness: 3 = 2
Coecient of Kurtosis: 4 = 3 + 12/r.

2
r;

5. Moment generating and Characteristic functions:

M (t) =
(t) =

1
(1 2t)r/2
1
(1 j2t)r/2

(9.44)
(9.45)

272

Random Phenomena

6. Reproductive Properties: The Chi-square random variable inherits

from the gamma random variable the following reproductive properties:
if Xi ; i = 1, 2, . . . , k, are k independent 2 (1) random variables, then

the random variable Y dened as Y = ki=1 Xi has a 2 (k) distribution. Similarly is Xi , i = 1, 2, . . . , n, are n independent
2 (r) random
n
variables, then the random variable W dened as W = i=1 Xi has a
2
(nr) distribution. These results nd important applications in statistical inference.
Because this random variable is the one family member that is not used in
reliability and lifetime studies, we do not provide expressions for the survival
and hazard functions.
Applications
The chi-square random variable is one of a handful of random variables
of importance in statistical inference. These applications are discussed more
fully in Part IV.

9.1.4

The Weibull Random Variable

To set the stage for the denition of the Weibull random variable, and a
derivation of its pdf, let us return briey to the discussion of the exponential
random variable and recall its hazard function, h(t) = , a constant. According
to the denition given in Chapter 4, the corresponding cumulative hazard
function (chf), H(t), is given by:
H(t) = t

(9.46)

Let us now consider a dierent situation in which the underlying Poisson

random variables intensity, , is not constant but dependent on the interval
length, t, in such a way that its chf, H(t), rather than being a linear function
of t as in Eq (9.46), is given instead by:
H(t) = (t)

(9.47)

Thus,
= h(t) =

d
H(t) = (t)(1)
dt

(9.48)

Basic Characteristics and Model Development

As a generalization of the exponential random variable, consider the random variable, T , dened as the waiting time to the rst occurrence of events
occurring this time at a non-constant, time-dependent mean rate given

Ideal Models of Continuous Random Variables

273

in Eq (9.48). Then, either by recalling the relationship between the cumulative hazard function and the cumulative distribution function of a random
variable, i.e.
FT (t) = 1 eH(t)
(9.49)
or else, following the derivation given above for the exponential random variable, we obtain the cumulative distribution function for this random variable,
T as:

FT (t) = 1 e(t)
(9.50)
Upon dierentiating once, we obtain:
f (t) = (t)1 e(t) ; 0 < t <

(9.51)

This is the pdf of a Weibull random variable, named for the Swedish scientist,
Waloddi Weibull (1887-1979), who derived and introduced this distribution
in a 1939 publication on the analysis of the breaking strength of materials5 .
It is important now to note that there is something of an empirical feel
to the Weibull distribution in the sense that if one can conceive of any other
reasonable hazard function, h(t), with the corresponding chf, H(t), such a
random variable will have a pdf given by
f (t) = h(t)eH(t)

(9.52)

There is nothing particularly phenomenological about the specic cumulative hazard function, H(t), introduced in Eq (9.47) which eventually led to the
Weibull distribution; it merely makes the simple linear chf H(t) = t of the
exponential random variable more complex by raising it to a generic power
an additional parameter that is to be chosen to t observations. Unlike the
parameter which has a phenomenological basis, there is no such basis for .
We shall have cause to return to this point a bit later.
The Model and Some Remarks
The pdf for the Weibull random variable, X, is given in general by:
f (x) = (x)1 e(x) ; 0 < x <

(9.53)

or, for = 1/,

f (x) =

x
e(x/) ; 0 < x <

(9.54)

The model is sometimes written in the following alternative form:

f (x) = x1 e

(9.55)

5 An earlier independent derivation due to Fisher and Tippet (1928) was unknown to
the engineering community until long after Weibulls work had become widely known and
adopted.

274

Random Phenomena

0.9

]
2
2
5
5

0.8
0.7

E
2
5
2
5

] E

f(x)

0.6
0.5
0.4

] E

0.3
0.2
0.1
0.0

] E
0

] E

FIGURE 9.4: Weibull pdfs for various values of parameter and : Note how with
increasing values of the shape becomes less skewed, and how the breadth of the
distribution increases with increasing values of
where

(9.56)

We prefer the form given in Eq (9.54). First, is the more natural parameter
(as we show shortly); second, its role in determining the characteristics of
the pdf is distinguishable from that of the second parameter in Eq (9.54),
whereas in Eq (9.55) is a convolution of the two parameters.
In the special case where = 1, the Weibull pdf, not surprisingly, reduces
to the exponential pdf. In general, the Weibull random variable, W (, ), is
related to the exponential random variable E() as follows: if Y E(), then
X = Y 1/

(9.57)

is a W (, ) random variable; conversely, if X W (, ) then

Y = X

(9.58)

is an E() random variable.

Important Mathematical Characteristics
The following are some important mathematical characteristics of the
Weibull random variable and its pdf:
1. Characteristic parameters: > 0, and > 0

Ideal Models of Continuous Random Variables

275

, as with all the other distribution in this family, is the scale parameter.
In this case, it is also known as the characteristic life for reasons
discussed shortly;
is the shape parameter. (See Fig 9.4).
2. Mean: = (1 + 1/); Mode = (1 1/)1/ ; 1; Median =
(ln 2)1/2 .

3. Variance: 2 = 2 (1 + 2/) [(1 + 1/)]2
4. Higher Moments: Closed form expressions for the coecients of Skewness and kurtosis are very complex, as are the expressions for the MGF
and the Characteristic function.
5. Survival function:
S(x) = e(x/) ; or e(x)

Hazard function:
h(x) =

1
x
, or (x)1

Cumulative Hazard function:

x
H(x) =
, or (x)

(9.59)

(9.60)

(9.61)

Applications
The Weibull distribution naturally nds application predominantly in reliability and life-testing studies. It is a very versatile pdf that provides a
particularly good t to time-to-failure data when mean failure rate is time
dependent. It is therefore utilized in problems involving lifetimes of complex
electronic equipment and of biological organisms, as well as in characterizing
failure in mechanical systems. While this pdf is sometimes used to describe the
size distribution of particles generated by grinding or crushing operations, it
should be clear from the derivation given in this section that such applications
are not as natural as life-testing applications. When used for particle size
characterization, the distribution is sometimes known as the Rosin-Rammler
distribution.
An interesting characteristic of the Weibull distribution arises from the
following result: when x = ,
P (X ) = 1 e1 = 0.632

(9.62)

for all values of . The parameter is therefore known as the characteristic

life; it is the 63.2 percentile of the lifetimes of the phenomenon under study.
Readers familiar with process dynamics will recognize the similarity this parameter bears with the time constant, , of the rst order system; in fact,
is to the Weibull random variable what is to rst order dynamic systems.

276

9.1.5

Random Phenomena

The Generalized Gamma Model

To conclude, we now note that all 4 random variables discussed in this

section can be represented as special cases of the generalized gamma random
variable, X, with the following pdf:

x
1
exp
(x )1 ; 0 < < x < (9.63)
f (x) =
()

clearly a generalization of Eq (9.27), with the location parameter, > 0; scale

parameter ; and shape parameters and . Observe that each pdf discussed
in this section may be obtained as special cases of Eq (9.63) as follows:

1. Exponential: = = 1; = 0;
2. Gamma: = 1; = 0;
3. Chi-squared r: = r/2, = 2, = 1; = 0;
4. Weibull: = 1; = 0;

highlighting why these four random variables naturally belong together.

Before leaving the Gamma family of distributions we wish to examine a
mixture distribution involving the gamma distribution.

9.1.6

The Poisson-Gamma Mixture Distribution

We begin with a Poisson random variable, X, whose now-familiar pdf is:

f (x) =

x e
; x = 0, 1, 2, . . . ,
x!

(9.64)

but this time, consider that the parameter is not constant, but is itself a
random variable. This will be the case, for example, if X represents the number
of automobile accidents reported to a company that insures a population of
clients for whom the propensity for accidents varies widely. The appropriate
model for the entire population of clients will then consist of a mixture of
Poisson random variables with dierent values of for the dierent subgroups
within the population. The two most important consequences of this problem
denition are as follows:
1. Both X and are random variables, and are characterized by a joint
pdf f (x, );
2. The pdf in Eq (9.64) must now properly be considered as the conditional

Ideal Models of Continuous Random Variables

277

distribution of X given , i.e., f (x|); the expression is a function of x

alone only if the parameter is constant and completely specied (as in
Example 8.8 in Chapter 8, where, for the inclusions problem, = 1).
Under these circumstances, what is desired is the unconditional (or marginal)
distribution for X, f (x); and for this we need a marginal distribution f () for
the parameter .
Let us now consider the specic case where the parameter is a gamma
distributed random variable, i.e.,
f () =

1
e/ 1 ; 0 < <
()

(9.65)

In this case, recalling the discussions in Chapter 5, we know that by denition,

the joint pdf is obtained as
f (x, ) = f (x|)f ()

(9.66)

from where we may now obtain the desired marginal pdf f (x) by integrating
out , i.e.,

x

e
1
/ 1
e
f (x|)f ()d =

f (x) =
d (9.67)
x!
()
0
0
which is easily rearranged to yield:
1
f (x) =
x! ()

(9.68)

where
1

=
=

x+
1+

(9.69)
(9.70)

The reason for such a parameterization is that the integral becomes easy to
determine by analogy with the gamma pdf, i.e.,

e/ 1 = ( ) ( )
(9.71)
0

As a result, Eq (9.68) becomes

( ) ( )
f (x) =
x! ()

(9.72)

which, upon introducing the factorial representation of the gamma function,

simplies to:
(x + 1)! x
f (x) =
(1 + )x
(9.73)
( 1)!x!

278

Random Phenomena

If we now dene

1
1+ = ; =
p

1p
p

;

(9.74)

(x + 1)!
p (1 p)x
( 1)!x!

(9.75)

then Eq (9.73) nally reduces to

f (x) =

immediately recognized as the pdf for a negative binomial N Bi(, p) random

variable. This pdf is formally known as the Poisson-Gamma mixture (or compound) distribution, because it is composed from a Poisson random variable
whose parameter is gamma distributed; it just happens to be identical to the
negative binomial distribution.
As stated in Section 8.7, therefore, the appropriate model for a Poisson
phenomenon with variable characteristic parameter is the negative binomial
distribution where, from the results obtained here, k = and p = 1/(1 + ).
The relationship between the underlying Poisson models parameter (which
is no longer constant) and the resulting negative binomial model is obtained
from the gamma pdf used for f (), i.e.,
E() = =

(1 p)
p

(9.76)

which is precisely the same as the expected value of X N Bi(, p).

9.2

Gaussian Family Random Variables

The next family of continuous random variables consists of the following

3 members:

The Gaussian (Normal) random variable,

The Lognormal random variable, and
The Rayleigh random variable

grouped together, once again, because of shared structural characteristics.

This time, however, what is shared is not the domain of the random variables.
While the rst takes values on the entire real line, < x < , the other two
take only non-negative values, 0 < x < . What is shared is the functional
form of the pdfs themselves.

Ideal Models of Continuous Random Variables

279

The Gaussian random variable, the rst, and dening member of this family, is unquestionably the most familiar of all random variables, nding broad
application in a wide variety of problems, most notably in statistical inference. Unfortunately, by the same token, it also is one of the most misused,
a point we shall return to later. The second, as its name (lognormal) suggests,
is derivative of the rst in the sense that a logarithmic transformation converts it to the rst random variable. It also nds application in statistical
inference, especially with such strictly non-negative phenomena as household
income, home prices, organism sizes, molecular weight of polymer molecules,
particle sizes in powders, crystals, granules and other particulate material, etc
entities whose values can vary over several orders of magnitude. The last
variable, perhaps the least familiar, is ideal for representing random deviations
of hits from a target on a plane. It owes its membership in the family to this
very phenomenological characteristic random uctuations around a target
that is shared in one form or another by other family members.
Once again, our presentation here centers around derivations of each random variables probability model, to emphasize the phenomena underlying
each one. A generalized model encompassing all family members is shown at
the end to tie everything together.

9.2.1

The Gaussian (Normal) Random Variable

Background and Model Development

The Gaussian random variable, one of the most versatile and most commonly encountered in physical phenomena, is so named for Johann Carl
Friedrich Gauss (17771855), the prodigious German mathematician and scientist, whose application of the pdf in his analysis of astronomical data helped
popularize the probability model.
However, while Gausss name is now forever associated with this random
variable and its pdf, the rst application on record is actually attributed to
Abraham de Moivre (16671754) in a 1733 paper, and later to Pierre-Simon
de Laplace (17491827) who used the distribution for a systematic analysis
of measurement errors. The term normal distribution became a widely accepted synonym in the late 1800s because of the popular misconception that
the pdf represents a law of nature which most, if not all, random phenomena
follow. The name, and to some extent, the misconception, survive to this day.
As we now show through a series of derivations, the Gaussian random variable is in fact primarily characterized by an accumulation of a large number
of small, additive, independent eects. We present three approaches to the
development of the probability model for this random variable: (i) as a limit
of the binomial random variable; (ii) from a rst principles analysis of random motion on a line (also known as the Laplace model of errors); and (iii)
the Herschel/Maxwell model of random deviations of hits from a target.

280

Random Phenomena

I: Limiting Case of the Binomial Random Variable

Let us begin by considering a binomial random variable, X Bi(n, p), in the
limit as the number of trials becomes very large, i.e. n . Unlike the Poisson random variable case, we will not require that p, the probability of success,
shrink commensurately; rather, p is to remain xed, with the implication that
the mean number of successes, x = np, will also continue to increase with n,
as will the observed number of successes, X, itself.
This motivates us to dene a random variable:
Y = X x

(9.77)

the total number of successes in excess of the theoretical mean (i.e. Y represents the deviation of X from the theoretical mean value). Observe that Y
is positive when the observed number of successes exceeds the mean number of successes, in which case X will lie to the right of x on the real
line. Conversely, a negative Y implies that there are fewer successes than the
mean number of successes (or equivalently, that there are more failures than
the mean number of failures), so that X will lie to the left of x . When
the number of successes matches the mean value precisely, Y = 0. If this
deviation variable is scaled by the standard deviation x , we obtain the
standardized deviation variable
Z=

X x
x

(9.78)

It is important to note that even though X, x and x can each potentially

increase indenitely as n , Z on the other hand is well-behaved: regardless of n, its expected value E(Z) = z = 0 and its variance z2 = 1 (see
Exercise 9.15).
We are now interested rst in obtaining a pdf, f (z), for Z the standardized
deviation of the binomial random variable from its theoretical mean value, in
the limit of a large number of trials; the desired f (x) for X will then be
recovered from f (z) and the transformation in Eq (9.78).
Out of a variety of ways of obtaining the pdf, f (z), we opt for the method
of characteristic functions. Recall that the characteristic function for the binomial random variable, X, is

n
(9.79)
x (t) = pejt + q
and from the properties of the MGF and CF given in Chapter 4, we obtain
from Eq (9.78) that
n

jtnp
(9.80)
z (t) = e( x ) pejt/x + q
whereupon taking natural logarithms yields

jtnp
ln z (t) =
+ n ln 1 + p ejt/x 1
x

(9.81)

Ideal Models of Continuous Random Variables

281

having introduced (1 p) for q, the complementary probability of failure. A

Taylor series expansion of the exponential term yields:
#

$
2
jt
jtnp
1 t
3
ln z (t) =
+ n ln 1 + p
+ O(x )
(9.82)

x
x
2 x
where O(x3 ) is a term that goes to zero faster than x3 as n (recall that

for the binomial random variable, x = npq, so that x3 0 as n ).

If we now invoke the result that
ln(1 + ) =

with
=p

jt
x

so that,
2 =

2
3
+

2
3

t
x

(9.83)

O(x3 )

p2 t2
+ O(x3 )
x2

(9.84)

(9.85)

then the second term on the RHS in Eq (9.82) reduces to:

pjt 1 t2

(p p2 ) + O(x3 )
n
x
2 x2
From here, the entire Eq (9.82) becomes:
ln z (t) =

jtnp jtnp 1 2 np(1 p)

+
t
+ O(x3 )
x
x
2
x2

(9.86)

which, in the limit as n , simplies to

1
ln z (t) = t2
2
so that:

z (t) = et

(9.87)

(9.88)

To obtain the corresponding f (z), we may now simply consult compilations of

characteristic functions, or else from the denition of characteristic function
and pdf pairs given in Chapter 4, obtain f (z) from the integral:

2
1
ejtz et /2 dt
(9.89)
f (z) =
2
and upon carrying out the indicated integration, we obtain the required pdf:
2
1
f (z) = ez /2
2

(9.90)

282

Random Phenomena

as the pdf of the standardized deviation variable, Z dened in Eq (9.78).

From the relationship between Z and X, it is now a straightforward matter
to recover the pdf for X as (see Exercise 9.15):
f (x) =

x 2

(xx )2
2
2x

(9.91)

This is the pdf of a Gaussian random variable with mean value x and standard deviation x inherited from the original binomial random variable.
II: First-Principles (Random Motion in a Line)
Consider a particle moving randomly along the x-axis with motion governed
by the following rules:
1. Each move involves taking a single step of xed length, x, once every
time interval t;
2. The step can be to the right (with probability p), or to the left (with
probability q = 1 p): for simplicity, this presentation will consider
equal probabilities, p = q = 1/2; the more general derivation is a bit
more complicated but the nal result is the same.
We are interested in the probability of nding a particle m steps to the right of
the starting point after making n independent moves, in the limit as n .
Before engaging in the model derivation, the following are some important
points about the integer m that will be useful later:
1. m can be negative or positive and is restricted to lie between n and n;
2. If k > 0 is the total number of steps taken to the right (so that the total
number of steps taken to the left is (n k)), for the particle to reach a
point m steps to the right implies that:
m = k (n k) = (2k n)
so that:
k=

1
(n + m)
2

3. m = 2k n must be even when n is even, and odd when n is odd;

therefore
4. m ranges from n to n in steps of 2. For example, if n = 3 (3 total steps
taken) then m can take only the values 3, 1, 1, or 3; and for n = 4,
m can only be 4, 2, 0, 2 or 4.
Now, dene as P (x, t), the probability that a particle starting at the origin
at time t = 0 arrives at a point x at time t, where,
x = mx; t = nt

(9.92)

Ideal Models of Continuous Random Variables

283

It is then true that:

P (x, t + t) =

1
1
P (x x, t) + P (x + x, t)
2
2

(9.93)

To reach point x at time (t + t), then at time t, one of two events must
happen: (i) the particle must reach point (x x) (with probability P (x
x, t)) and then take a step to the right (with probability 1/2); or (ii) the
particle must reach point (x + x) (with probability P (x + x, t)), and then
take a step to the left (with probability 1/2). This is what is represented by
Eq (9.93); in the limit as n , it provides an expression for the pdf we
wish to derive. The associated initial conditions are:
P (0, 0) =
P (x, 0) =

1
0; x = 0

(9.94)

indicating that, since we began at the origin, the probability of nding this
particle at the origin, at t = 0, is 1; and also that the probability of nding
the particle, at this same time t = 0, at any other point, x that is not the
origin, is 0.
Now, as n , for t to remain xed, t must tend to zero; similarly, as
n , so must m , and for x to remain xed, x must tend to zero
as well. However, by denition, m < n so that as both become large, n
faster than m so that m/n 0; i.e.
x t
x t
m
=
=
0
n
x t
t x
implying that x/t but in such a way that

x
x
= xed
t

(9.95)

(9.96)

The importance of this point will soon become clear.

Now, subtracting P (x, t) from both sides of Eq (9.93) yields:
1
1
1
1
P (x x, t) P (x, t) + P (x + x, t) P (x, t)
2
2
2
2
(9.97)
Upon multiplying the left hand side by (t/t) and the right hand side by
(x2 /x2 ) and rearranging appropriately, we obtain:

P (x, t + t) P (x, t)
LHS =
t
(9.98)
t

1 P (x + x, t) P (x, t) P (x, t) P (x x, t)
1
(x)2

RHS =
2
x
x
x

P (x, t + t) P (x, t) =

And now, recalling Eq (9.96) and dening

lim

x0
t0

(x2 )
= D = 0
2t

(9.99)

284

Random Phenomena

(where D is a xed constant), then upon taking limits as x 0 and t 0

above, we obtain:
P (x, t)
2 P (x, t)
=D
(9.100)
t
x2
an equation that may be recognizable to readers familiar with the physical
phenomenon of diusion.
Before proceeding to solve this equation for P (x, t) however, we need to
recognize that, strictly speaking, as the number of points becomes innite (as
n, m and x 0 and t 0), the probability of nding a particle at
any arbitrary point x tends to zero. To regularize this function, let us recall
the nature of m (which indicates that the particle can occupy only every other
point on the x-axis), and introduce the function
f (x, t) =

P (x, t)
2x

(9.101)

As a result, the probability of nding a particle, at time t, on or between two

points a1 = ix and a2 = kx is properly given by:
P (a1 < x < a2 ; t) =

f (mx, t)2x

(9.102)

m=i

(keeping in mind that the sum is for every other value of m from i to k, with
both indices even if n is even, and odd if n is odd). Eq (9.100) now becomes
2 f (x, t)
f (x, t)
=D
t
x2

(9.103)

which is more mathematically precise for determining probabilities in the limit

as the original function becomes continuous, even though it looks almost exactly like Eq (9.100). Observe that in the limit as x 0, Eq (9.102) becomes:
a2
f (x, t)dx
(9.104)
P (a1 < x < a2 ; t) =
a1

so that, as the reader may perhaps have suspected all along, f (x, t) is the
required (time dependent) probability density function for the random phenomenon in question random motion on a line.
We are now in a position to solve Eq (9.103), but the initial conditions for
P (x, t) in Eq (9.94) must now be modied appropriately for f (x, t) as follows:

f (x, t)dx = 1

lim f (x, t) = 0

(9.105)

The rst implies that, at any point in time, the particle will, with certainty, be
located somewhere on the xaxis; the second is the continuous equivalent of

Ideal Models of Continuous Random Variables

285

the condition on P (x, 0) in Eq (9.94), a requirement for the particle starting

at the origin, at t = 0.
Readers familiar with partial dierential equations in general, or the one
shown in Eq (9.103) in particular, will be able to conrm that the solution,
subject to the conditions in Eq (9.105), is:
x2
1
f (x, t) =
e 4Dt
4Dt

(9.106)

If we now return to the denition of the parameter D in Eq (22.58), we

see that:

t
2Dt = x2
= n x2
(9.107)
t
which is the total sum of squared displacements (uctuations) to the right and
to the left in time t. If we represent this measure of the vigor of the dispersion
as b2 , we obtain the expression,
x2
1
f (x) = e 2b2
b 2

(9.108)

where the time argument t has been suppressed because it is no longer explicit (having been subsumed into the dispersion parameter b). Finally, for an
arbitrary starting point a = 0 the required pdf becomes:
(xa)2
1
f (x) = e 2b2
b 2

(9.109)

which is to be compared with the expression obtained earlier in Eq (9.91).

III: Herschel/Maxwell Model
Consider an experiment in which small pellets (or grains of sand) are dropped
unto a plane from above the point labeled O as illustrated in Fig 9.5. We
are interested in the probability of nding one of the pellets on the shaded
element of area A under the following conditions:
1. Symmetry: There is no systematic deviation of pellets from the center
point, O; i.e. deviations are purely random so that for a given r, all
angular positions, , are possible. As a result, the probability of nding
a pellet in a small area at a distance r from the center is the same for
all such areas at the same distance.
2. Independence: The probability of nding a pellet between the points x
and x + x in the horizontal coordinate is completely independent of
the pellet position in the vertical, y coordinate.
From point 1 above, the pdf we seek must be a function of r alone, say g(r),

286

Random Phenomena

A
y
r

FIGURE 9.5: The Herschel-Maxwell 2-dimensional plane

where, of course, from standard analytical geometry, the following relationships hold:
x

r cos ;

y
r2

=
=

r sin ;
x2 + y 2 .

(9.110)

From point 2, the joint pdf satises the condition that

f (x, y) = f (x)f (y).

(9.111)

As a result, the probability of nding a pellet in an area of size A is

g(r)A = f (x)f (y)A

(9.112)

which, upon taking logs and using Eqs (9.110), yields

ln g(r) = ln f (r cos ) + ln f (r sin )

(9.113)

Dierentiating with respect to results in:

0 = r sin

f (r cos )
f (r sin )
+ r cos
f (r cos )
f (r sin )

(9.114)

where the 0 on the LHS arises because r is independent of . If we now return

to cartesian coordinates, we obtain
f (x)
f (y)
=
xf (x)
yf (y)

(9.115)

Ideal Models of Continuous Random Variables

287

It is now clear that x and y are entirely independent since the LHS is a function
of x alone and the RHS is a function of y alone; furthermore, these two will
then be equal only if they are both equal to a constant, say c1 , i.e.
f (x)
= c1 x
f (x)

(9.116)

Integrating and rearranging leads to

ln f (x)
f (x)

1
c1 x2 + c2 ; or
2
2
1
= ke 2 c1 x
=

(9.117)

where the new constant k = exp(c2 ).

We now need to determine the integration constants. Because f (x) must
be a valid pdf, it must remain nite as x ; this implies immediately that
c1 must be negative, say c1 = 1/b2 , with the result that:
1 x2

f (x) = ke 2 b2

(9.118)

Simultaneously, because of Eq (9.115),

2
1 y

f (y) = ke 2 b2

(9.119)

so that
1 r2

g(r) = f (x)f (y) = k 2 e 2 b2

(9.120)

In general, if the point O is not the origin but some other arbitrary point
(ax , ay ) in the plane, then Eq (9.118) becomes:
f (x) = ke
And now, since

(xax )2
2b2

(9.121)

f (x)dx = 1, and it can be shown that

(xax )2
2b2

dx = b 2,

(9.122)

then it follows that k = 1/b 2 so that the required pdf is given by:
(xax )2
1
f (x) = e 2b2
b 2

exactly as we had obtained previously.

(9.123)

288

Random Phenomena

The Model and Some Remarks

The pdf for a Gaussian random variable is

(x a)2
1
f (x) = exp
; < x < ; b > 0
2b2
b 2

(9.124)

a model characterized by two parameters, a, the location parameter, and b,

the scale parameter. Now, because, E(X) = = a; and V ar(X) = 2 = b2 ,
the more widely encountered form of the pdf is

(x )2
1
(9.125)
f (x) = exp
2 2
2
which co-opts the universal symbols for mean and standard deviation for this
particular random variables mean and standard deviation. A variable with
this pdf is said to possess a N (, 2 ) distribution, normal, with mean , and
variance 2 .
Some remarks are now in order. The derivations shown above indicate
that the random phenomenon underlying the Gaussian random variable is
characterized by:
1. Observations composed of the net eect of many small, additive, perturbations, some negative, some positive; or
2. Random, symmetric, independent deviations from a target or true
value.
Thus, such diverse phenomena as measurement errors, heights and weights in
human populations, velocities of gas molecules in a container, and even test
scores, all tend to follow the Gaussian distribution. However, it is important
to note that the Gaussian distribution is not a law of nature, contrary to
a popular misconception. The synonym Normal used to describe this pdf
tends to predispose many to assume that just about any random variable has
a Gaussian distribution. Of course, this is clearly not true. (Recall that we
have already discussed many random variables that do not follow the Gaussian
distribution.)
This misconception of near-universality of the Gaussian random variable
is also fueled by a property of averages of random variables discussed later in
Part 4 when we examine the Central Limit Theorem. For now, we caution the
reader to be careful how the N (, 2 ) distribution is assumed for a random
variable: if the underlying characteristics are not reasonably close to the ones
discussed above, it is unlikely that the Gaussian distribution is appropriate.
Important Mathematical Characteristics
The following are some key mathematical characteristics of the N (, 2 )
random variable and its pdf:

Ideal Models of Continuous Random Variables

0.4

P
0
0
3
5

P V

289

V
1
5
2
5

f(x)

0.3

0.2
P V

0.1
P V

P V
0.0

-20

-10

FIGURE 9.6: Gaussian pdfs for various values of parameter and : Note the symmetric shapes, how the center of the distribution is determined by , and how the shape
becomes broader with increasing values of

1. Characteristic parameters: , the location parameter, is also the

mean value; , the scale parameter, is also the standard deviation;
2. Mean: E(X) = ; Mode = = Median
3. Variance: V ar(X) = 2
4. Higher Moments: Coecient of Skewness: 3 = 0;
Coecient of Kurtosis: 4 = 3. This value of 3 is the standard reference
for kurtosis alluded to in Chapter 4. Recall that distributions for which
4 < 3 are said to be platykurtic (mildly peaked) while those for which
4 > 3 are said to be leptokurtic (sharply peaked).
5. Moment generating and Characteristic functions:
M (t) =
(t)

1
exp{t + 2 t2 }
2
1
exp{jt 2 t2 }
2

(9.126)
(9.127)

6. The function is perfectly symmetric about ; Also, at x = , f (x) = 0

and f (x) < 0 establishing x = also as the mode; nally, f (x) = 0
at x = . See Fig 9.6.

290

Random Phenomena

Applications
The Gaussian random variable plays an important role in statistical inference where its applications are many and varied. While these applications
are discussed more fully in Part IV, we note here that they all involve computing probabilities using f (x), or else, given specied tail area probabilities,
using f (x) in reverse to nd the corresponding x values. Nowadays, the tasks
of carrying out such computations have almost completely been delegated to
computer programs; traditionally, practical applications required the use of
pre-computed tables of normal cumulative probability values. Because it is
impossible (and unrealistic) to generate tables for all conceivable values of
and , the traditional normal probability tables are based on the standard
normal random variable, Z, which, as we now discuss, makes it possible to
apply these tables for all possible values of and .

9.2.2

The Standard Normal Random Variable

If the random variable X possesses a N (, 2 ) distribution, then the random variable dened as:
X
(9.128)
Z=

has the pdf

1
f (z) = exp
2

z 2
2

(9.129)

Z is called a standard normal random variable; its mean value is 0, its standard
deviation is 1, i.e. it has a N (0, 1) distribution. A special case of the general
Gaussian random variable, its traditional utility derives from the fact that for
any general Gaussian random variable X N (, 2 ),

P (a1 < X < a2 ) = P

a1
a2
<Z <

(9.130)

so that tables of N (0, 1) probability values for various values of z can be used
to compute probabilities for any and all general N (, 2 ) random variables.
The z-score of any particular value xi of the general Gaussian random
variable X N (, 2 ) is dened as
zi =

(9.131)

Probabilities such as those shown in Eq (9.130) are therefore determined on

the basis of the z-scores of the indicated values a1 and a2 .
Furthermore, because the distribution is symmetric about x = 0, it is true
that:
FZ (a) = 1 FZ (a)
(9.132)

Ideal Models of Continuous Random Variables

291

0.4

f(x)

0.3

0.2

0.1

0.025
0.0

0.025
-1.96

0
x

1.96

FIGURE 9.7: Symmetric tail area probabilities for the standard normal random variable
with z = 1.96 and FZ (1.96) = 0.025 = 1 FZ (1.96)
where FZ (a) is the cumulative probability dened in the usual manner as:
a
FZ (a) =
f (z)dz
(9.133)

Figure 9.7 shows this for the specic case where a = 1.96 for which the tail
areas are each 0.025.
This result has the implication that tables of tail area probabilities need
only be made available for positive values of Z. The following example illustrates this point.
Example 9.3 POST-SECONDARY EXAMINATION TEST
SCORES
A collection of all the test scores for a standardized, post-secondary examination administered in the 1970s across countries along the West
African coast, is well-modeled as a random variable X N (, 2 ) with
= 270 and = 26. If a score of 300 or higher is required for a passwith-distinction grade, and a score between 260 and 300 is required
for a merit-pass grade, what percentage of students will receive the
distinction grade and what percentage will receive the merit grade?
Solution:
The problem requires computing the following probabilities: P (X
300) and P (260 < X < 300).

300 270
P (X 300) = 1 P (X < 300) = 1 P Z <
26
(9.134)
= 1 FZ (1.154)

292

Random Phenomena
indicating that the z-score for x = 300 is 1.154. From tables of cumulative probabilities for the standard normal random variable, we obtain
FZ (1.154) = 0.875 so that the required probability is given by
P (X 300) = 0.125

(9.135)

implying that 12.5% of the students will receive the distinction grade.
The second probability is obtained as:

300 270
260 270
<Z<
P (260 < X < 300) = P
26
26

30
10
FZ
(9.136)
= FZ
26
26
And now, by symmetry, F (10/26) = 1 F (10/26) so that, from the
cumulative probability tables, we now obtain:
P (260 < X < 300) = 0.875 (1 0.650) = 0.525

(9.137)

with the implication that 52.5% of the students will receive the merit
grade.
Of course, with the availability of such computer programs as
MINITAB, it is possible to obtain the required probabilities directly
without recourse to the Z probability tables. In this case, one simply
obtains FX (300) = 0.875 and FX (260) = 0.35 from which the required
probabilities and percentages are obtained straightforwardly.

Important Mathematical Characteristics

In addition to inheriting all the mathematical characteristics of the Gaussian random variable, the standard normal random variable in its own right
has the following relationship to the Chi-square random variable: If X
N (0, 1) then X 2 2 (1) (a result that was actually established in Chapter 6,
Example 6.3). In general, if X N (, 2 ), then [(X )/]2 2 (1). Finally,
by virtue of the reproductive property of the Chi-square random variable, if
Xi ; i = 1, 2, . . . n are n independent Gaussian random variables with identical
distributions N (, 2 ), then the random variable dened as:
2
n

Xi
Y =
(9.138)

i=1
possesses a 2 (n) distribution (See Exercise 9.20). These results nd extensive
application in statistical inference.

9.2.3

The Lognormal Random Variable

Basic Characteristics and Model Development

By analogy with the Gaussian random variable, consider a random variable
whose observed value is composed of a product of many small independent

Ideal Models of Continuous Random Variables

random quantities, i.e.
X=

293

(9.139)

i=1

Taking natural logarithms of this expression yields:

ln X =

ln Xi =

i=1

(9.140)

i=1

Following the discussion in the previous

n sections, we now know that in the
limit as n becomes very large, Y = i=1 Yi tends to behave like a Gaussian
random variable, with the implication that ln X is a random variable with a
Gaussian (Normal) pdf. If the mean is designated as and variance as 2 ,
then the pdf for the random variable Y = ln X is:

1
(y )2

g(y) =
exp
(9.141)
2 2
2
Using techniques discussed in Chapter 6, it can be shown (see Exercise 9.21)
that from the variable transformation and its inverse,
Y
X

=
=

ln X
eY

(9.142)

one obtains from Eq (9.141) the required pdf for X as:

(ln x )2
1
exp
;0 < x <
f (x) =
2 2
x 2

(9.143)

The random variable X whose pdf is shown above, is referred to as a lognormal

random variable for the obvious reason that the (natural) logarithm of X has
a normal distribution. Eqn (9.143) is therefore an expression of the pdf for
the lognormal random variable with parameters and , i.e. X L(, 2 ).
An alternative form of this pdf is obtained by dening
= ln m; m = e

(9.144)

so that Eq (9.143) becomes

f (x) =

x 2

exp

(ln x/m)2
2 2

;0 < x <

(9.145)

Important Mathematical Characteristics

The following are some key mathematical characteristics of the L(, 2 )
random variable and its pdf:

294

Random Phenomena

1. Characteristic parameters: , (or m > 0) is the scale parameter;

2
> 0, (or w = e ) is the shape parameter. Because of the structural
similarity between this pdf and the Gaussian pdf, it is easy to misconstrue as the mean of the random variable X, and as the standard
deviation. The reader must be careful not to fall into this easy trap:
is the expected value not of X, but of ln X; and 2 is the variance of
ln X; i.e.
E(ln X) =
V ar(ln X) = 2

(9.146)

It is precisely for this reason that in this textbook we have deliberately

opted not to use and in the lognormal pdf: using these symbols leads
to too much confusion.

2
2. Mean: E(X) = exp( + 2 /2) = me /2 = m w;
2
Mode = m/w = e( )

Median = m = e
2

2
3. Variance: V ar(X) = e(2+ ) e 1 = m2 w(w 1)
Note that V ar(X) = [E(X)]2 (w 1)

4. Higher Moments: Coecient of Skewness: 3 = (w + 2) (w 1);
Coecient of Kurtosis: 4 = w4 + 2w3 + 3w2 3.
5. Moment generating and Characteristic functions: Even though
all moments exist for the lognormal distribution, the MGF does not
exist. The characteristic function exists but is quite complicated.
A plot of the lognormal distribution for various values of the shape parameter
, with the scale parameter xed at = 0, is shown in Fig 9.8. On the other
hand, a plot for various values of , with the shape parameter xed at = 1,
is shown in Fig 9.9.
An important point that must not be missed here is as follows: whereas for
the Gaussian distribution is a location parameter responsible for shifting
the distribution, the corresponding parameter for the lognormal distribution,
, does not shift the distributions location but rather scales its magnitude.
This is consistent with the fact that the additive characteristics underlying
the Gaussian distribution correspond to multiplicative characteristics in the
lognormal distribution. Thus, while a change in shifts the location of the
Gaussian distribution, a change in scales (by multiplication) the lognormal
distribution. The parameter is a location parameter only for the distribution
of ln X (which is Gaussian) not for the distribution of X.
A nal point to note: while the most popular measure of central location,
the mean E(X), depends on both and for the lognormal distribution, the
median on the other hand, m = e , depends only on . This suggests that
the median is a more natural indicator of central location for the lognormal

Ideal Models of Continuous Random Variables

295

1.8
1.6

E
1.5
1
0.5
0.25

1.4

f(x)

1.2
1.0
0.8

E
E

0.6
0.4
0.2
0.0

E
0

FIGURE 9.8: Lognormal pdfs for scale parameter = 0 and various values of the
shape parameter . Note how the shape changes, becoming less skewed as becomes
smaller.

0.4

D
0.5
1
2

f(x)

0.3

0.2

0.1

D
0.0

30
x

FIGURE 9.9: Lognormal pdfs for shape parameter = 1 and various values of the
scale parameter . Note how the shape remains unchanged while the entire distribution
is scaled appropriately depending on the value of .

296

Random Phenomena

random variable. By the same token, a more natural measure of dispersion for
this random variable is V ar(X)/[E(x)]2 , the variance scaled by a square of
the mean value, or, equivalently, the square of Cv , the coecient of variation:
from the expression given above for the variance, this quantity is (w 1),
2
depending only on w = e .
Applications
From the preceding considerations regarding the genesis of the lognormal
distribution, it is not surprising that the following phenomena generate random variables that are well-modeled with the lognormal pdf:
1. Size of particles obtained by breakage (grinding) or granulation of ne
particles;
2. Size of living organisms, especially where growth depends on numerous
factors proportional to instantaneous organism size;
3. Personal income, or net worth, or other such quantities for which the
current observation is a random proportion of the previous value (e.g.,
closing price of stocks, or index options).
The lognormal distribution is therefore used for describing particle size distributions in mining and granulation processes as well as in atmospheric studies;
for molecular weight distributions of polymers arising from complex reaction
kinetics; for distributions of incomes in a free market economy.
Because it is a skewed distribution just like the gamma distribution, the
lognormal distribution is sometimes used to describe such phenomena as latent periods of infectious diseases, or age at the onset of such diseases as
Alzheimers or arthritis phenomena that are more naturally described by
the gamma density since they involve the time to the occurrence of events
driven by multiple cumulative eectors.
Finally, in statistical inference applications, probabilities are traditionally
computed for lognormal random variables using Normal probability tables.
Observe that if X L(, 2 ), then:
P (a1 < X < a2 ) = P (ln a1 < Y < ln a2 )

(9.147)

where Y N (, 2 ). Thus, using tables of standard normal cumulative probabilities, one is able to obtain from Eq (9.147) that:

ln a1
ln a2
P (a1 < X < a2 ) = P
<Z <

ln a2
ln a1
= F
F
(9.148)

However, statistical software packages such as MINITAB provide more ecient

means of computing desired probabilities directly without resorting to tables
based on such transformations.

Ideal Models of Continuous Random Variables

297

Example 9.4 PRODUCT QUALITY ATTAINMENT IN

GRANULATION PROCESS
Granular products made from pan granulation processes are typically
characterized by their particle size distributions (and bulk densities).
Material produced in a particular month at a manufacturing site has a
particle size distribution that is well-characterized by a lognormal distribution with scale parameter = 6.8 and shape parameter = 0.5,
i.e. X L(6.8, 0.5). The product quality requirements (in microns) is
specied as 350 m < X < 1650 m, and the yield of the process is the
percentage of the product meeting this requirement. To be protable,
the manufacturing process yield must be at least 85% month-to-month.
(i) What is this months process yield? (ii) If particles for which
X < 350 m are classied as nes and can be recycled, how much
of the material made during this month falls into this category?
Solution:
The problem requires computing the following probabilities: (i) P (350 <
X < 1650) and (ii) P (X < 350) for the L(6.8, 0.5) distribution. These
values are obtained directly from MINITAB (or other such software)
directly without using transformations and tables of standard normal
cumulative probabilities. First,
P (350 < X < 1650) = 0.858

(9.149)

as shown in Fig (9.10). This indicates a yield of 85.8%, the percentage of

material produced meeting the quality requirement; the manufacturing
plant is therefore protable this particular month.
(ii) The second required probability is obtained directly from MINITAB
as P (X < 350) = 0.030. Thus, 3% of the material will be classied as
nes that can be recycled.
These two probabilities imply that there is a residual 11.2% of the product with size in excess of 1650 microns, falling into the oversized category, since by denition:
1 = P (X < 350) + P (350 < X < 1650) + P (X > 1650)

9.2.4

(9.150)

The Rayleigh Random Variable

Let us consider a 2-dimensional vector (X1 , X2 ) where the components

are mutually independent random variables representing random deviations
of hits from a target on a plane whose coordinates are X1 , X2 . The magnitude
of this random vector is the random variable X,
%
(9.151)
X = X12 + X22
In such a case, X is known as a Rayleigh random variable. To obtain the
probability model for this random variable, we merely need to recall that this

298

Random Phenomena

0.0010
Shaded area probability
0.858

f(x)

0.0008

0.0006

0.0004

0.0002

0.0000

350

1650
x

FIGURE 9.10: Particle size distribution for the granulation process product: a lognormal distribution with = 6.8, = 0.5. The shaded area corresponds to product meeting
quality specications, 350 < X < 1650 microns.
description is exactly as in the Herschel/Maxwell model presented earlier,
except that this time

(9.152)
x = r2
Thus, from Eq (9.120), we obtain immediately that,
f (x) =

x 12 x22
e b ; x > 0; b > 0
b2

(9.153)

This is the pdf for the Rayleigh random variable R(b). It can be shown via
methods presented in Chapter 6 that if Y1 N (0, b2 ) and Y2 N (0, b2 ) then
X=

%
Y12 + Y22

(9.154)

possesses a Rayleigh R(b) distribution (See Exercise 9.25).

Important Mathematical Characteristics
The following are some key mathematical characteristics of the R(b2 ) random variable and its pdf:
1. Characteristic parameter: b > 0 is the scale parameter;

2. Mean: E(X) = b /2; Mode = b; Median = b ln 4

3. Variance: V ar(X) = b2 (2 /2)

Ideal Models of Continuous Random Variables

299

4. Higher Moments:

Coecient of Skewness: 3 = 2 ( 3)/(4 )3/2 ;

Coecient of Kurtosis: 4 = 3 [(6 2 24 + 16)/( 2 8 + 16)].
5. Moment generating and Characteristic functions: Both the MGF
and CF exist but are quite complicated.
As a nal point of interest, we note that the Weibull
pdf with = 2 is
identical to a Rayleigh pdf with the parameter b = /2. At rst blush,
this appears to be no more than an odd coincidence since, from the preceding
discussions, there is no obvious physical reason for the Weibull distribution,
which is most appropriate for reliability and life-testing problems, to encompass as a special case, the Rayleigh distribution. In other words, there is no
apparent structural reason for a member of the Gaussian family (the Rayleigh
random variable) to be a special case of the Weibull distribution, a member
of the totally dierent Gamma family. However there are two rational justications for this surprising connection: the rst purely empirical, the second
more structural.
1. Recall Eq (9.52) where we noted that the Weibull pdf arose from a
specication of a generic cumulative hazard function (chf) made deliberately more complex than that of the exponential by simply introducing
a power: any other conceivable (and dierentiable) chf, H(t), will give
rise to a pdf given by Eq (9.52). In this case, the Rayleigh random
variable arises by specifying a specic chf (in terms of x, rather than
time, t), H(x) = (x)2 . However, this perspective merely connects the
Rayleigh to the Weibull distribution through the chf; it gives no insight
into why the specic choice of = 2 is a special case of such interest as
to represent an entirely unique random variable.
2. For structural insight, we return to the exponential pdf and view it
(as we are perfectly at liberty to do) as a distribution of distances of
particles from a xed point in 1-dimension (for example, the length to
the discovery of the rst aw on a length of ber-optic cable), where the
particles (e.g. aws) occur with uniform Poisson intensity . From
this perspective, the hazard function h(x) = (with the corresponding
linear H(t) = t) represents the uniform exponential random variable
failure rate, if failure is understood as not nding a aw at a location
between x and x + x. The 2-dimensional version of this problem, the
distribution of the radial distances of aws from a xed center point
in a plane (where X 2 = (X12 + X22 )) has a square form for the chf,
H(x) = (x)2

(9.155)

with a corresponding hazard function h(x) = 2(x) indicating that the

failure rate (the rate of not nding a aw at a radial distance x from
the center in a 2-dimensional plane) increases with x. This, of course, is

300

Random Phenomena
precisely the conceptual model used to derive the Rayleigh pdf; it shows
how = 2 in the Weibull distribution is structurally compatible with
the phenomenon underlying the Rayleigh random variable.

Applications
The Rayleigh distribution nds application in military studies of battle
damage assessment, especially in analyzing the distance of bomb hits from
desired targets (not surprising given the discussion above). The distribution
is also used in communication theory for modeling communication channels,
and for characterizing satellite Synthetic Aperture Radar (SAR) data.

9.2.5

The Generalized Gaussian Model

To conclude, we note that all three random variables discussed above can
be represented as special cases of the random variable X with the following
pdf:
#
f (x) = C1 (x) exp

(C2 (x) C3 )
2C4

$
(9.156)

1. Gaussian (Normal): C1 (x) = 1/( 2); C2 (x) = x; C3 = ; C4 = 2 ;

2. Lognormal: C1 (x) = 1/(x 2); C2 (x) = ln x; C3 = ; C4 = 2 ;

3. Rayleigh: C1 (x) = x/b2 ; C2 (x) = x; C3 = 0; C4 = b2 ;

9.3

Ratio Family Random Variables

The nal family of continuous random variables to be discussed in this

Chapter consists of the following 5 members:

Ideal Models of Continuous Random Variables

301

The Beta random variable,

The (Continuous) Uniform random variable,
Fishers F random variable,
Students t random variable, and
The Cauchy random variable

This grouping of random variables is far more diverse than any of the
earlier two groupings. From the rst two that are dened only on bounded
regions of nite size on the real line, to the last two that are always symmetric
and are dened on the entire real line, and the third one that is dened only
on the semi-innite positive real line, these random variables appear to have
nothing in common. Nevertheless, all members of this group do in fact share a
very important common characteristic: as the family name implies, they all
arise as ratios composed of other (previously encountered) random variables.
Some of the most important random variables in statistical inference belong to this family; and one of the benets of the upcoming discussion is that
the ratios from which they arise provide immediate indications (and justication) of the role each random variable plays in statistical inference. In the
interest of limiting the length of a discussion that is already quite long, we
will simply state the results, suppressing the derivation details entirely, or else
referring the reader to appropriate places in Chapter 6 where we had earlier
provided such derivations in anticipation of these current discussions.

9.3.1

The Beta Random Variable

Basic Characteristics and Model Development

Consider two mutually independent gamma random variables Y1 and Y2
possessing (, 1) and (, 1) pdfs respectively. Now dene a new random
variable X as follows:
Y1
X=
(9.157)
Y1 + Y2
i.e. the proportion contributed by Y1 to the ensemble sum. Note that 0 <
x < 1, so that the values taken by this random variable will be fractional.
A random variable thus dened is a beta random variable, and we are now
interested in obtaining its pdf.
From the pdfs of the gamma random variables, i.e.
f (y1 ) =
f (y2 ) =

1 1 y1
y
e ; 0 < y1 <
() 1
1 1 y2
y
e ; 0 < y2 <
() 2

(9.158)
(9.159)

302

Random Phenomena

we may use methods discussed in Chapter 6 (specically, see Example 6.3) to

obtain that the required pdf for X dened in Eq (9.157), is:
f (x) =

( + ) 1
x
(1 x)1 ; 0 < x < 1; > 0; > 0
()()

(9.160)

The Model and Some Remarks

The pdf for X, the Beta random variable, B(, ), is:
f (x) =

( + ) 1
x
(1 x)1 ; 0 < x < 1; > 0; > 0
()()

(9.161)

The name arises from the relationship between the pdf above and the Beta
function dened by:
1
()()
(9.162)
u1 (1 u)1 du =
Beta(, ) =
( + )
0
This pdf in Eq (9.161) is the rst continuous model to be restricted to a nite
interval, in this case, x [0, 1]. This pdf is dened on this unit interval, but
as we show later, it is possible to generalize it to a pdf on an arbitrary nite
interval [0 , 1 ].
Important Mathematical Characteristics
The following are some key mathematical characteristics of the B(, )
random variable and its pdf:
1. Characteristic parameters: > 0 and > 0 are both shape parameters. The pdf takes on a wide variety of shapes depending on the values
of these two parameters. (See below.)
2. Mean: E(X) = /( + );
Mode = ( 1)/( + 2) for > 1, > 1, otherwise no mode exists.
There is no closed form expression for the Median.
3. Variance: V ar(X) = /( + )2 ( + + 1)
4. Higher Moments:
Coecient of Skewness:
2( )
3 =
( + + 2)

1
1
1
+ +

Coecient of Kurtosis:
4 =
.

3( + )( + + 1)( + 1)(2 ) ( )
+
( + + 2)( + + 3)
( + )

Ideal Models of Continuous Random Variables

303

5. Moment generating and Characteristic functions: Both the MGF

and CF exist but are quite complicated.

The Many Shapes of the Beta Distribution

First, here are a few characteristics that can be deduced directly from an
examination of the functional form of f (x):
1. For > 1 and > 1, the powers to which x and (1 x) are raised will
be positive; as a result, f (x) = 0 at both boundaries x = 0 and x = 1
and will therefore have to pass through a maximum somewhere in the
interval (0,1), at a location determined by the value of relative to .
2. Conversely, for 0 < , < 1 the powers to which x and (1 x) are raised
will be negative so that f (x) has asymptotes at both boundaries x = 0
and x = 1.
3. For 0 < < 1 and > 1, f (x) has an asymptote at x = 0 but is
zero at x = 1; complementarily, for 0 < < 1 and > 1, f (x) has an
asymptote at x = 1 and is zero at x = 0.
4. For = , f (x) is symmetric, with = = 1 being a special case in
which f (x) is at a case of special signicance warranting a separate
discussion.
From these considerations, the following observations of the shapes of the Beta
distribution follow:
1. When , > 1, the Beta distribution is unimodal, skewed left when
> , skewed right when < , and symmetric when = , as shown
in Fig 9.11.
2. When , < 1, the Beta distribution is U-shaped with a sharper
approach to the left asymptote at x = 0 when > , a sharper approach
to the right asymptote at x = 1 when < , and a symmetric U-shape
when = , as shown in Fig 9.12.
3. When ( 1)( 1) 0, the Beta distribution is J-shaped with a
left-handed J when < , ending at zero at x = 1 for = 1 and at a
non-zero value when = 1 (as shown in Fig 9.13). The Beta distribution
is a right-handed J when > (not shown).
4. In the special case when = 1 and = 2, the Beta distribution is a
right-sloping straight line, as shown in Fig 9.13; when = 2 and = 1
we obtain a left-sloping straight line (not shown).

304

Random Phenomena

3.0
2.5

D E

D E
D E

2.0
f(x)

D
1.5
1.5
5
5

E
5
1.5
5
1.5

1.5
1.0
D E
0.5
0.0

0.0

0.2

0.4

0.6

0.8

1.0

FIGURE 9.11: Unimodal Beta pdfs when > 1; > 1: Note the symmetric shape
when = , and the skewness determined by the value of relative to

3.0

D
0.8
0.2
0.2
0.5

D E
2.5

E
0.2
0.2
0.8
0.5

f(x)

2.0
D E

1.5
1.0

D E
0.5
D E
0.0

0.0

0.2

0.4

0.6

0.8

1.0

FIGURE 9.12: U-Shaped Beta pdfs when < 1; < 1

Ideal Models of Continuous Random Variables

3.0

305

D
1
0.5
0.5
0.2

2.5
D E

E
2
2
1
2

f(x)

2.0
1.5
D E
1.0
D E

0.5
0.0

D E

0.0

0.2

0.4

0.6

0.8

1.0

FIGURE 9.13: Other shapes of the Beta pdfs: It is J-shaped when ( 1)( 1) < 0
and a straight line when = 2; = 1
Applications
The Beta distribution naturally provides a good model for many random
phenomena involving proportions. For example, it is used in Bayesian analysis (see later) for describing `
a-priori knowledge about p, the Binomial probability of success. Another example practical application (mostly in quality
assurance) arises from the following result:
Given n independent random observations, 1 , 2 , , n from a phenomenon possessing an arbitrary pdf, rank the observations from the smallest
to the largest as y1 , y2 , , yn (i.e. with y1 as the smallest and yn as the
largest). If yr is the rth -smallest and yns+1 is the sth -largest, then regardless
of the underlying pdf of the variable, X, the proportion of the population between yr and yns+1 possesses a B(, ) distribution with = (n s + 1) r,
and = r + s, i.e.
f (x; n, r, s) =

(n + 1)
xnrs (1 x)r+s1 ;
(n r s + 1)(r + s)

(9.163)

This important result frees one from making any assumptions about the underlying pdf of the population from which the original quality data 1 , 2 , , n
came from.
The example below illustrates yet another application of the Beta distribution, in functional genomics studies.
Example 9.5 DIFFERENTIAL GENE EXPRESSION STATUS FROM MICROARRAYS
In functional genomics, one of the objectives is to provide a quantitative (as opposed to qualitative) understanding of the functions of genes

306

Random Phenomena
and how they regulate the function of complex living organisms. The
advent of the high-throughput microarray technology has made it possible to collect expression data on every gene in a cell simultaneously.
Such microarray data, usually presented in the form of uorescence
signal intensities measured from each spot i on a microarray, yield an
ordered pair (yi1 , yi0 ) with yi1 coming from the gene in question under
test conditions (e.g. from a cancer cell), and yi0 under control conditions (e.g. normal cell). In theory, if the gene is up-regulated under test
conditions, F C = yi1 /yi0 > 1, F C < 1 if down-regulated, and F C = 1
if non-dierentially regulated. This quantity, F C, is the so-called foldchange associated with this gene under the test conditions. However,
because of measurement noise and other myriad technical considerations, this ratio is dicult to characterize statistically and not terribly
reliable by itself.
It has been suggested in Gelmi6 , to use the fractional intensity xi
dened by:
xi =

yi1
yi1 + yi0

(9.164)

because this fraction can been shown to possess a Beta distribution.

In this case, theoretically, xi > 0.5 for up-regulated genes, xi < 0.5
for down-regulated genes, and xi = 0.5 for non-dierentially regulated
ones, with noise introducing uncertainties into the computed values.
Microarray data in the form of xi may thus be analyzed using the Beta
distribution to provide probabilities that the gene in question is up-,
down-, or non-dierentially-regulated.
Replicate data on a particular gene of interest on a microarray
yielded an average fractional intensity value x
= 0.185 and standard
deviation s = 0.147. Using estimation techniques discussed in Part IV,
it has been determined that this result indicates that the fractional
intensity data can be considered as a sample from a theoretical Beta
distribution with = 1.2; = 5.0. Determine the probability that the
gene is up-regulated.
Solution:
Recalling that if a gene in question is truly up-regulated, xi > 0.5, the
problem requires computing P (Xi > 0.5). This is obtained directly from
MINITAB as 0.0431 (See Fig 9.14), indicating that it is highly unlikely
that this particular gene is in fact up-regulated.

6 Gelmi, C. A. (2006)A novel probabilistic framework for microarray data analysis: From
fundamental probability models to experimental validation, PhD Thesis, University of
Delaware.

Ideal Models of Continuous Random Variables

307

3.5
3.0

f(x)

2.5
D E

2.0
1.5
1.0
0.5

0.0431
0.0

0.5
x

FIGURE 9.14: Theoretical distribution for characterizing fractional microarray intensities for the example gene: The shaded area corresponds to the probability that the gene
in question is upregulated.

9.3.2

Extensions and Special Cases of the Beta Random

Variable

Generalized Beta Random Variable

It is relatively easy to generalize the Beta pdf to cover the interval (0 , 1 )
where 0 and 1 are not necessarily 0 and 1, respectively. The general pdf is:
f (x) =

1
1

1
( + ) x 0
x 0
(9.165)
1
(1 0 ) ()() 1 0
1 0
0 < x < 1 ; > 0; > 0

Inverted Beta Random Variable

Let the random variable X1 have a B(, ) pdf and dene a new random
variable as
X1
(9.166)
X=
1 X1
It is easy to show from the preceding discussion (on the genesis of the Beta
random variable from contributing gamma random variables Y1 and Y2 ) that
X dened as in Eq (9.166) above may also be represented as:
X=

Y1
Y2

(9.167)

308

Random Phenomena

i.e. as a ratio of gamma random variables. The random variable X dened in

this manner has the Inverted Beta pdf:
f (x) =

( + )
x1
; x > 0; > 0; > 0
()() (1 + x)+

(9.168)

This is sometimes also referred to by the somewhat more cumbersome Beta

distribution of the second kind. This random variable is related to the
F distribution to be discussed shortly.

9.3.3

The (Continuous) Uniform Random Variable

Basic Characteristics, Model and Remarks

When = = 1 in the Beta pdf, the result is the special function:
f (x) = 1; 0 < x < 1

(9.169)

a pdf of a random variable that is uniform on the interval (0,1). The general
uniform random variable, X, dened on the interval (a, b) has the pdf:
f (x) =

1
;a < x < b
ba

(9.170)

It is called a uniform random variable, U (a, b), for the obvious reason that,
unlike all the other distributions discussed thus far, the probability description
for this random variable is completely uniform, favoring no particular value
in the specied range. The uniform random variable on the unit interval (0,1)
is known as a standard uniform random variable.
Important Mathematical Characteristics
The following are some key mathematical characteristics of the U (a, b)
random variable and its pdf:
1. Characteristic parameters: a, b jointly form the range, with a as the
location parameter (see Fig 9.15). Narrower distributions are longer than
wider ones because the total probability area must equal 1 in every case.
2. Mean: E(X) = (a + b)/2;
Median = Mean;
Mode: non-unique; all values in interval (a, b);
3. Variance: V ar(X) = (b a)2 /12
4. Moment generating and Characteristic functions:
M (t) =
(t) =

ebt eat
;
t(b a)

ejbt ejat
jt(b a)

Ideal Models of Continuous Random Variables

309

a b
0 1
2 10

1.0

0.8
U(0,1)

f(x)

0.6

0.4

0.2

0.0

U(2,10)

FIGURE 9.15: Two uniform distributions over dierent ranges (0,1) and (2,10). Since
the total area under the pdf must be 1, the narrower pdf is proportionately longer than
the wider one.

Of the direct relationships between the U (a, b) random variable and other
random variables, the most important are (i) If X is a U (0, 1) random variable,
then Y = ln X is an exponential random variable E(). This result was
established in Example 6.2 in Chapter 6. (ii) If X U (0, 1) then Y = 1X 1/
is a Beta random variable B(1, ).
Applications
The uniform pdf is the obvious choice for describing equiprobable events on
bounded regions of the real-line. (The discrete version is used for equiprobable
discrete events in a sample space.) But perhaps its most signicant application
is for generating random numbers for other distributions.

9.3.4

Fishers F Random Variable

Basic Characteristics, Model and Remarks

Consider two mutually independent random variables Y1 and Y2 respectively possessing 2 (1 ) and 2 (2 ) distributions; a random variable X dened
as:
Y1 /1
X=
(9.171)
Y2 /2
is a Fisher F random variable, F (1 , 2 ) named in honor of R. A. Fisher (18901962), the father of the eld of Statistical Design of Experiments. It is possible

310

Random Phenomena

to show that its pdf is:

!
"
12 (1 + 2 ) 11 /2 21 /2 x(1 /21)
f (x) =
;0 < x <
(1 /2)(2 /2) (1 x + 2 )(1 +2 )/2

(9.172)

Important Mathematical Characteristics

The following are some key mathematical characteristics of the F (1 , 2 )
random variable and its pdf:
1. Characteristic parameters: 1 , 2 , parameters inherited directly from
the contributing Chi-square random variables, retain their Chi-square
distribution characteristics as degrees of freedom.
2. Mean:
E(X) = =
Mode:
x =

2
; 2 > 2
2 2

2 (1 2)
; 1 > 2
1 (1 + 2)

3. Variance:
V ar(X) = 2 =

222 (1 + 2 2)
; 2 > 4
1 (2 2)2 (2 4)

(9.173)

Expressions for Skewness and Kurtosis are a bit complicated; the MGF
does not exist and the expression for the CF is complicated.
Figure 9.16 shows two F distributions for the same value of 2 = 15 but
dierent values of 1 .
The F distribution is related to two additional pdfs as follows: If X has
an F distribution with 1 , 2 degrees of freedom, then
Y =

(1 /2 ) X
1 + (1 /2 ) X

(9.174)

has a Beta B(1 /2, 2 /2) distribution.

Alternatively, if Y has an Inverted Beta distribution with = 1 /2; =
2 /2 then
1
X= Y
(9.175)
2
has an F (1 , 2 ) distribution.

Ideal Models of Continuous Random Variables

311

0.9
Q Q
5 15
10 15

0.8
Q Q

0.7

f(x)

0.6
0.5
0.4

Q Q

0.3
0.2
0.1
0.0

FIGURE 9.16: Two F distribution plots for dierent values for 1 , the rst degree
of freedom, but the same value for 2 . Note how the mode shifts to the right as 1
increases

Applications
The F distribution is used extensively in statistical inference to make probability statements about the ratio of variances of random variables, providing
the basis for the F -test. It is the theoretical probability tool for ANOVA
(Analysis of Variance). Values of P (X x) are traditionally tabulated for
various values of 1 , 2 and selected values of x, and referred to as F -tables.
Once again, computer programs have made such tables somewhat obsolete.
As shown in Part IV, the F distribution is one of the central quartet
of pdfs at the core of statistical inference the other three being the Gaussian (Normal) N (, 2 ), the Chi-square 2 (r), and the Student t-distribution
discussed next.

9.3.5

Students t Random Variable

Basic Characteristics, Model and Remarks

Let Z be a standard normal random variable, (i.e. Z N (0, 1)); and let
Y , a random variable independent of Z, possess a 2 () distribution. Then
the random variable dened as the following ratio:
Z
X=
Y /

(9.176)

312

Random Phenomena

df
5
10
100

0.4

f(x)

0.3

0.2

0.1

0.0

Q
-5

-4

-3

-2

-1

0
x

FIGURE 9.17: Three tdistribution plots for degrees of freedom values = 5, 10, 100.
Note the symmetrical shape and the heavier tail for smaller values of .
is a Students t random variable t(). It can be shown that its pdf is:
"
!
12 ( + 1)
1
; < x <
(9.177)
f (x) =

2
()(/2) 1 + x (+1)/2

First derived in 1908 by W. S. Gossett (1876-1937), a Chemist at the Guiness

Brewing Company who published under the pseudonym Student, this f (x)
is known as Students t-distribution, or simply the t-distribution.
Even though it may not be immediately obvious from the somewhat
awkward-looking form of f (x) shown above, the pdf for a t() random variable, is symmetrical about x = 0 and appears like a heavier-tailed version of
the standard normal N (0, 1) distribution. In fact, in the limit as , the
tdistribution tends to N (0, 1). In practice, for 50, the t-distribution is
virtually indistinguishable from the standard normal distribution. These facts
are illustrated in Figure 9.17, which shows tdistributions for three dierent
degrees of freedom = 5, 10, 100; Figure 9.18, which compares the t(5) and
standard normal distributions; and nally Figure 9.19, which shows that the
t-distribution with 50 degrees of freedom is practically identical to the N (0, 1)
distribution.
Important Mathematical Characteristics
The following are some key mathematical characteristics of the t() random variable and its pdf:
1. Characteristic parameter: , the degrees of freedom;

Ideal Models of Continuous Random Variables

313

N(0,1)

0.4

t(5)

f(x)

0.3

0.2

0.1

0.0

-5

-4

-3

-2

-1

0
x

FIGURE 9.18: A comparison of the tdistribution with = 5 with the standard normal
N (0, 1) distribution. Note the similarity as well as the t-distributions comparatively
heavier tail.

N(0,1)

0.4

t(50)

f(x)

0.3

0.2

0.1

0.0

-4

-3

-2

-1

0
x

FIGURE 9.19: A comparison of the tdistribution with = 50 with the standard

normal N (0, 1) distribution. The two distributions are practically indistinguishable.

314

Random Phenomena

2. Mean: E(X) = = 0; Mode = 0 = Median

3. Variance: V ar(X) = 2 = /( 2); > 2
4. Higher Moments: Coecient of Skewness: 3 = 0 for > 3 (indicating
that the distribution is always symmetric);
Coecient of Kurtosis:

2
4 = 3
; > 4
4
This shows that for smaller values of , the kurtosis exceeds the standard
reference value of 3 for the normal random variable (implying a heavier
leptokurtic tail); as , however, 4 3.
Of the relationships between the t() random variable and other random variables, the most important are (i) lim t() N (0, 1), as noted earlier;
and (ii) If X t(), then X 2 F (1, ).
Applications
The t-distribution is used extensively in statistical inference, especially for
comparing the means of two populations given only nite sample data. It is a
key theoretical probability tool used in problems requiring tests of hypotheses
involving means, providing the basis of the familiar t-test.
As with all the other distributions employed in data analysis and statistical inference, tables of tail area probabilities P (X x) are available for
various degrees of freedom values. Once again, the ability to compute these
probabilities directly using computer programs is making such tables obsolete.

9.3.6

The Cauchy Random Variable

Basic Characteristics, Model and Remarks

If Y1 is a Normal random variable with a N (0, 12 ) distribution, and Y2 ,
independent of Y1 , is a Normal random variable with a N (0, 22 ), then the
random variable, X, dened as the following ratio of the two random variables,
X=

Y1
Y2

(9.178)

is a Cauchy random variable. If 1 /2 , the ratio of the contributing variances

is designated as , then it can be shown that the pdf for the Cauchy random
variable is:
1
1

; < x < ; > 0
(9.179)
f (x) =
1 + x22
In particular, when 1 = 2 = 1 (so that Y1 and Y2 are independent standard
normal random variables), = 1 above and the pdf becomes
f (x) =

1
1
; < x <
(1 + x2 )

(9.180)

Ideal Models of Continuous Random Variables

315

N(0,1)

0.4

Cauchy (0,1)

f(x)

0.3

0.2

0.1

0.0

-10

-5

0
x

FIGURE 9.20: A comparison of the standard Cauchy distributions with the standard
normal N (0, 1) distribution. Note the general similarities as well as the Cauchy distributions substantially heavier tail.

the expression for the standard Cauchy distribution an expression that was
derived in Example 6.9 in Chapter 6.
The pdf for the general Cauchy random variable, C(, 2 ), is:
f (x) =

1
1
'
( ; < x < ; > 0
1 + (x)2

(9.181)

In a manner somewhat reminiscent of the t-distribution, the Cauchy distribution is also symmetric (about in the general case, about 0 in the standard
case), but with much heavier tails than the normal distribution. (See Fig 9.20.)

Important Mathematical Characteristics

The Cauchy random variable is quite unusual in that its pdf has no nite
moments, i.e. E(X k ) does not exist for any k > 0. The characteristic parameters and are therefore not what one would expectthey do not represent
the mean and standard deviation.
1. Characteristic parameters: , is the location parameter; , is the
scale parameter.
2. Mean: E(X) does not exist; Mode = = Median
3. Variance: V ar(X) does not exist; neither does any other moment.

316

Random Phenomena

Of the relationships between the Cauchy random variable and other random
variables, the most notable are (i) By its very composition as a ratio of zero
mean Gaussian random variables, if X is a Cauchy random variable, its reciprocal 1/X is also a Cauchy random variable; (ii) The standard Cauchy random
variable C(0, 1) is a special (pathological) case of the t distribution with = 1
degrees of freedom. When we discuss the statistical implication of degrees of
freedom in Part IV, it will be come clearer why = 1 is a pathological case.
Applications
The Cauchy distribution is used mostly to represent otherwise symmetric
phenomena where the occurrences of extreme values that are signicantly far
from the central values are not so rare. The most common application is in
modeling high-resolution rates of price uctuations in nancial markets. Such
data tend to have heavier tails and are hence poorly modeled by Gaussian
distributions.
It is not dicult to see, from the genesis of the Cauchy random variable as
a ratio of two Gaussians, why such applications are structurally reasonable.
Price uctuation rates are approximated as a ratio of P , the change in the
price of a unit of goods, and t the time interval over which the price change
has been computed. Both are independent random variables (prices may remain steady for a long time and then change rapidly over short periods of time)
that tend, under normal elastic market conditions to uctuate around some
mean value. Hence, P/t will appear as a ratio of two Gaussian random
variables and will naturally follow a Cauchy distribution.

9.4

Summary and Conclusions

Using precisely the same techniques and principles employed in the previous chapter, we have turned our attention in this chapter to the complementary task of model development for continuous random variables, with
the discrete Poisson random variable serving as the connecting bridge between the two chapters. Our emphasis has been on the fundamentals of how
each probability model arises for these continuous random variables, with the
derivation details presented explicitly in many cases, or else left as exercises.
By design, we have encountered these continuous random variables and
their models in family groups whose members share certain common structural traits: rst, the Gamma family of strictly positive random variables,
typically used to represent phenomena involving intervals of time and space
(length, area, volume), or, as with the Chi-square random variable, squared
and other variance-related phenomena; next the Gaussian family, with functional forms indicative of squared deviations from a target; and nally the

Ideal Models of Continuous Random Variables

317

Ratio family of random variables strategically composed from ratios of other

random variables. From the description of the phenomenon in question, the
ensuing model derivation, the resulting model and its mathematical characterization, and from the illustrative examples given for each random variable, it
should be clear for which practical phenomena these models are most applicable. And for some of these random variables, what we have seen in this chapter
is no more than just a brief cameo appearance; we will most denitely see
them and their models again, in their more natural application settings. In
particular, when the topic of statistics and statistical inference is studied in
detail in Part IV, we will be reacquainted with a quartet of random variables
and pdfs whose role in such studies is dominant and central: the Gaussian
distribution, for representingprecisely or approximatelyan assortment of
random variations related to experimental data and functions of such data;
the Chi-square, Students t, and Fishers F distributions, for testing various
hypotheses. Also, it should not be surprising when members of the Gamma
family, especially the exponential and Weibull random variables, reappear to
play central roles in the discussion of reliability and life testing of Chapter 21.
This is an admittedly long chapter; and yet, for all its length and breadth of
coverage, it is by no means exhaustive. Some continuous random variables were
not discussed at all; and the introduction to mixture distributions (where the
parameters of a probability model is itself a random variable, leading to a compounding of models) was all-too-brief, restricted only to the Poisson-gamma
model, which happens to lead to the negative binomial distribution. Some of
the omitted pdfs have been introduced as exercises at the end of the chapter,
e.g., the double exponential (Laplace), inverse gamma, logistic, and Pareto distributions, in the category of continuous pdfs, and the Beta-Binomial mixture
model. Nevertheless, in terms of intended scope of coveragethe variety of
continuous (and some mixed) probability models that have been discussed
one can consider the task begun in Chapter 8 as having now been concluded in
this chapter. However, in terms of model development techniquesthe how
of probability model development, as opposed to the models themselvesthere
is one more topic to consider, when available information on the random variable of interest is incomplete. This is next chapters focus.
Table 9.1, similar to Table 8.2 in Chapter 8, is a summary of the main characteristics, models, and other important features of the continuous random
variables discussed in this chapter. Fig 9.21 provides a schematic consolidation of all the random variables we have encountered, discrete and continuous,
and the connections among them.

REVIEW QUESTIONS
1. What are the four members of the Gamma family of random variables discussed
in this chapter?
2. What are the common structural characteristics shared by the members of the

Cauchy
C(0, 1)

x2
2b2

See Eq (9.172)

1
ba

(, )

1
1
(1+x2 )

x)1

(ln x)2
2 2

(+) 1
(1
()() x

exp

(x)2
22

(, ) See Eq (9.177)

(a, b)
(0, )

(0, 1)

R(b2 )
Beta
B(, )

Uniform U (a, b)
Fisher
F (1 , 2 )
Students
t()

x
b2

(0, )

Rayleigh

exp

x 2

(0, )

Lognormal
L(, )

2 (1 + 2/) 2

Variance ( 2 )
V ar(X)
2

Median=0

a+b
2
2
2 2

N/A

See Eq (9.173)

(ba)2
12

(+)2 (++1)

exp( + 2 /2) exp(2 + 2 )

(exp( 2 ) 1)

b /2
b2 (2 /2)

(1 + 1/)

1 (x/)
e
(x/)

(0, )
exp

1
ex/2 xr/21
2r/2 (r/2)

(0, )

1
2

1
x/ 1
x
() e

(0, )

Mean ()
E(X)

Probability Model
f (x)
1 x/
e

Range

Summary of probability models for continuous random variables

Gaussian
(, )
2
Normal N (, )

Weibull
W (, )

Random
Variable
Exponential
E()
Gamma
(, )
Chi-Square
2 (r)

TABLE 9.1:

Y /

Y1 , Y2 , N (0, 1)
Y1 /Y2 C(0, 1)

B( = ) = U (0, 1)
Y1 2 (1 ); Y2 2 (2 )
Y1 /1
Y2 /2 F (1 , 2 )
lim t() = N (0, 1)
Z N (0, 1); Y 2 ()
Z t()

Y1 , Y2 , N (0, b2 )

Y12 + Y22 R(b2 )
X1 (, 1); X2 (, 1)
X1
X1 +X2 B(, )

Y N (, 2 )
X = eY L(, )

limn Bi(n, p)

Xi E()
n
i=1 Xi (, )
2 (r) = (r/2, 2)
Xi N (0, 1)

n
2
i=1 Xi (n)
Y E()
Y W (, )

Relation to
Other Variables
Inter-Poisson intervals

318
Random Phenomena

Ideal Models of Continuous Random Variables

Uniform
(Discrete)

319

DISCRETE

Hypergeometric

Bernoulli

Binomial
[Multinomial]

Negative
Binomial
(Pascal)
Geometric

Poisson

CONTINOUS
GAUSSIAN
FAMILY

Weibull

Exponential

CONTINOUS
GAMMA
FAMILY
Gamma
(Erlang)

Gaussian
(Normal)

Lognormal

Rayleigh

Cauchy

CONTINOUSRATIOFAMILY

Chi-Sq (1)

Students t

Chi-Sq (r)

Fishers F

Beta

Uniform
(Continuous)

FIGURE 9.21: Common probability distributions and connections among them

320

Random Phenomena

Gamma family of random variables?

3. Which member of the Gamma family of distributions is structurally dierent from
the other three, and in what way is it dierent?
4. What is the relationship between the exponential and the Poisson random variables?
5. What are the basic characteristics of the exponential random variable?
6. What is the probability model for the exponential random variable?
7. How are the geometric and exponential random variables related?
8. The exponential pdf nds application in what class of problems?
9. Why is the exponential pdf known as a memoryless distribution? Are there
other distributions with this characteristic?
10. What are the basic characteristics of the gamma random variable?
11. How is the gamma random variable related to the Poisson random variable?
12. How are the gamma and exponential random variables related?
13. What is the probability model for the gamma random variable and what do the
parameters represent?
14. Under what condition is the gamma distribution known as the Erlang distribution?
15. What does it mean that the gamma random variable possesses a reproductive
property?
16. The gamma pdf nds application in what class of problems?
17. What is the relationship between the Chi-square and gamma random variables?
18. What are the basic characteristics of the Chi-square random variable?
19. What is the probability model for the Chi-square random variable and what is
the single parameter called?
20. In what broad area does the Chi-square pdf nd application?
21. What dierentiates the Weibull random variable from the exponential random
variable?

Ideal Models of Continuous Random Variables

321

22. What are the basic characteristics of the Weibull random variable?
23. What is the probability model for the Weibull random variable?
24. Why is the Weibull pdf parameter known as the characteristic life?
25. The Weibull pdf nds application in what class of problems?
26. What mixture pdf arises from a Poisson pdf whose parameter is gamma distributed?
27. What are the three members of the Gaussian family of random variables discussed in this chapter?
28. What are the common structural characteristics shared by the members of the
Gaussian family of random variables?
29. What are the three approaches used in this chapter to derive the probability
model for the Gaussian random variable?
30. What are the basic characteristics of the Gaussian random variable?
31. What is the probability model for the Gaussian random variable and what do
the parameters represent?
32. In what broad area does the Gaussian pdf play an important role?
33. What is the probability model for the standard normal random variable? How
is the standard normal random variable related to the Gaussian random variable?
34. What is the z-score of any particular value xi of the general Gaussian random
variable with mean , and variance 2 ? How is it useful for computing probabilities
for general Gaussian distributions?
35. What are the basic characteristics of the lognormal random variable?
36. How is the lognormal random variable related to the Gaussian (normal) random
variable?
37. What is the probability model for the lognormal random variable?
38. What trap is to be avoided in interpreting what the parameters of the lognormal
distribution represent?
39. What is the dierence between the parameter for the normal distribution
and the corresponding parameter for the lognormal distribution in terms of how
changes in each parameter modify the distribution it characterizes?

322

Random Phenomena

40. Which phenomena are well-modeled by the lognormal pdf?

41. What are the basic characteristics of the Rayleigh random variable?
42. What is the probability model for the Rayleigh random variable?
43. What is the relationship between the Weibull distribution and the Rayleigh distribution and why does this appear to be an odd coincidence?
44. What are some rational justications for why the Rayleigh random variable is
related to the Weibull random variable?
45. In what class of problems does the Rayleigh pdf nd application?
46. What are the four members of the Ratio family of random variables discussed
in this chapter?
47. What are the common structural characteristics shared by the members of the
Ratio family of random variables?
48. The Beta random variable is composed as a ratio of which random variables?
49. What is the probability model for the Beta random variable?
50. What are the various shapes possible with a Beta pdf, and what specic combinations of distribution parameters result in which shape?
51. The Beta pdf provides a good model for what types of random phenomena?
52. What is an inverted Beta random variable and how is it related to the Beta
random variable?
53. What are the basic characteristics of the (continuous) uniform random variable?
54. What is the probability model for the (continuous) uniform random variable?
55. How is the (continuous) uniform random variable related to the Beta random
variable?
56. What is the (continuous) uniform pdf used for mostly?
57. Fishers F random variable is composed as a ratio of which random variables?
58. What is the relationship between Fishers F distribution and the Beta distribution?
59. What is the F distribution used for most extensively?

Ideal Models of Continuous Random Variables

323

60. What are the four central pdfs used most extensively in statistical inference?
61. Students t random variable is composed as a ratio of which random variables?
62. What is the relationship between Students t distribution and the standard normal distribution?
63. What is the t distribution used for most extensively?
64. The Cauchy random variable is composed as a ratio of which random variables?
65. What is the probability model for the Cauchy random variable?
66. How many moments exist for the Cauchy distribution?
67. The Cauchy distribution is used mostly for what?

EXERCISES
Section 9.1
9.1 (i) On the same graph, plot the pdf for the discrete geometric G(0.25) and the
continuous exponential E (4) distributions. Repeat this for the following additional
pairs of distributions: G(0.8) and E (1.25); and G(0.5) and E (2).
(ii) These plots show specic cases in which the pdf of the geometric random variable
G(p) is seen to be a discretized version of the continuous pdf for the exponential
random variable E (1/p), and vice-versa: that the E () distribution is a continuous
version of the discrete G(1/) distribution. First, show that for the geometric random
variable, the following relationship holds:
f (x + 1) f (x)
=p
f (x)
which is a nite dierence discretization of the expression,
df (x)
= p.
f (x)
From here, establish the general result that the pdf of a geometric random variable
G(p) is the discretized version of the pdf of the exponential random variable, E (1/p).
9.2 Establish that the median of the exponential random variable, E (), is ln 2,
and that its hazard function is
h(t) =

1
=

9.3 Given two independent random variables, X1 and X2 , with identical exponential
E () distributions, show that the pdf of their dierence,
Y = X1 X2

(9.182)

324

Random Phenomena

is the double exponential (or Laplace) distribution dened as:

f (y) =

|y|
e
; < y <
2

(9.183)

where = 1/.
9.4 Revisit Exercise 9.3. Directly from the pdf in Eq (15.183), and the formal denitions of moments of a random variable, obtain the mean and variance of Y . Next,
obtain the mean and variance from Eq (15.182) by using the result that because the
two random variables are independent,
E(Y )

E(X1 ) E(X2 )

V ar(Y )

V ar(X1 ) + V ar(X2 )

9.5 Given that X E (1), i.e., an exponentially distributed random variable with
parameter = 1, determine the following probabilities:
(i) P (X X 3X ) where X is the mean value, and X is the standard deviation,
the positive square root of the variance, 2 .
(ii) P (X 2X < X < X + 2X )
9.6 (i) Establish the result in Eq (9.17).
(ii) From the denition of the gamma function:

ey y 1 dy
() =

(9.184)

establish the following properties:

(a) (1) = 1;
(b) () = ( 1)( 1); > 1;
(c) () = ( 1)! for integer .
9.7 For the random variable X with the (, ) pdf:
f (x) =

1
ex/ x1 ; x > 0; , > 0
()

establish the following about the moments of this pdf:

(i) = ;
(ii) 2 = 2 ;
(iii) M (t) = (1 t)
9.8 Show that if Yi is an exponential random variable with parameter , i.e. Yi
E (), then the random variable X dened as in Eq (9.37), i.e.,:
X=

i=1

is the gamma random variable (, ).

9.9 Establish the following results for the gamma random variable:

Ideal Models of Continuous Random Variables

325

(i) If Xi , i = 1, 2, . . . , n, are n independent gamma random variables, each with

dierent shape parameters i but a common scale parameter , i.e. Xi (i , ),
then the random variable Y dened as:
Y =

i=1

is also a gamma random variable, with shape parameter =

parameter , i.e. Y ( , ).
(ii) Show that the random variable Z dened as
Z=c

n
i=1

i and scale

i=1

wherec is a constant, is also a gamma random variable with shape parameter

= n
i=1 i but with scale parameter c, i.e. Z ( , c).
9.10 The distribution of residence times in a standard size continuous stirred tank
reactor (CSTR) is known to be exponential with = 1, i.e., E (1). If X is the
residence time for a reactor that is ve times the standard size, then its distribution
is also known as E (0.2). On the other hand, Y , the residence time in an ensemble
of ve identical, standard size CSTRs in series, is known to be gamma distributed
with = 5; = 1.
(i) Plot the pdf f (x) for the single large CSTRs residence time distribution and the
pdf f (y) for the ensemble of ve identical small reactors in series. Determine the
mean residence time in each case.
(ii) Compute P (Y 5) and compare with P (X 5)
9.11 Given that X (, ), show that the pdf for Y , the Inverse Gamma, IG,
random variable dened by Y = 1/X is given by:
f (y) =

(1/) (1/)/y 1
e
y
;0 < y <
()

(9.185)

Determine the mean, mode and variance for this random variable.
9.12 Establish the following results that (i) if Y E (), then
X = Y 1/
is a W (, ) random variable; and (ii) conversely, that if X W (, ) then
Y = X
is an E () random variable.
9.13 A uidized bed reactor through which chlorine gas ows has a temperature
probe that fails periodically due to corrosion. The length of time (in days) during
which the temperature probe functions is known to be a Weibull distributed random
variable X, with parameters = 10; = 2.
(i) Determine the number of days each probe is expected to last.

326

Random Phenomena

(ii) If the reactor operator wishes to run a product campaign that lasts continuously
for 20 days, determine the probability of running this campaign without having to
replace the temperature probe.
(iii) What is the probability that the probe will function for anywhere between 10
and 15 days?
(iv) What is the probability that the probe will fail on or before the 10th day?
9.14 Suppose that the time-to-failure (in minutes) of certain electronic device components, when subjected to continuous vibrations, may be considered as a random
variable having a Weibull(, ) distribution with = 1/2 and = = 1/10: rst
nd how long we may expect such a component to last, and then nd the probability
that such a component will fail in less than 5 hours.
Section 9.2
9.15 Given two random variables X and Z related according to Eq (9.78), i.e.,
Z=

X x
,
x

where, as dened in the text, E(X) = x and V ar(X) = x2 ,

(i) Show that E(Z) = 0 and V ar(Z) = 1.
(ii) If the pdf of Z is as given in Eq (9.90), i.e.,
2
1
f (z) = ez /2
2

determine the pdf for the random variable X and hence conrm Eq (9.91).
9.16 Given a Gaussian distributed random variable, X, with = 100; = 10,
determine the following probabilities:
(i) P (1.96 < X < 1.96) and P (3 < X < 3)
(ii) P (X > 123) and P (74.2 < X < 126)
9.17 Given Z, a standard normal random variable, determine the specic variate z0
that satises each of the following probability statements:
(i) P (Z z0 ) = 0.05; P (Z z0 ) = 0.025
(ii) P (Z z0 ) = 0.025; P (Z z0 ) = 0.10; P (Z z0 ) = 0.10
(iii) P (|Z| z0 ) = 0.00135
9.18 Given Z, a standard normal random variable, determine the following probabilities
(i) P (1.96 < Z < 1.96) and P (1.645 < Z < 1.645)
(ii) P (2 < Z < 2) and P (3 < Z < 3)
(iii) P (|Z| 1)
9.19 Consider the random variable X with the following pdf:
f (x) =

ex
(1 ex )2

This is the pdf of the (standard) logistic distribution.

(9.186)

Ideal Models of Continuous Random Variables

327

(i) Show that E(X) = 0 for this random variable.

(ii) On the same graph, plot this pdf and the standard normal pdf. Compare and
contrast the two pdfs. Discuss under which condition you would recommend using
the logistic distribution instead of the standard normal distribution.
9.20 Let Xi ; i = 1, 2, . . . n, be n independent Gaussian random variables with identical distributions N (, 2 ); show that the random variable dened as in Eq (9.138),
i.e.,
2
n

Xi
Y =

i=1
possesses a 2 (n) distribution.
9.21 Show that if the random variable Y has a normal N (, 2 ) distribution, then
the random variable X dened as
X = eY
has a lognormal distribution, with parameters and , as shown in Eq (9.143).
Obtain an explicit relationship between (, ) and (, 2 ).
9.22 Revisit Exercise 9.21 and establish the reciprocal result that if the random
variable X has a lognormal L(, ), then the random variable Y dened as
Y = ln X
has a normal N (, 2 ) distribution.
9.23 Given a lognormal distributed random variable X with parameters = 0; =
2
; on the same graph, plot the pdf,
0.2, determine its mean, X , and variance, X
f (x), and that for the Gaussian random variable with the same mean and variance
as X. Compare the two plots.
9.24 Revisit Exercise 9.23. Compute P (X 1.96X < X < X + 1.96X ) from
a lognormal distribution. Had this random variable been mistakenly assumed to be
Gaussian with the same mean and variance, compute the same probability and compare the results.
9.25 Show that that if Y1 N (0, b2 ) and Y2 N (0, b2 ), then
%
X = Y12 + Y22

(9.187)

possesses a Rayleigh R(b) distribution, with pdf given by Eq (9.153).

9.26 Given a random variable X with a Rayleigh R(b) distribution, obtain the pdf
of the random variable Y dened as
Y = X2
Section 9.3

328

Random Phenomena

9.27 Conrm that if a random variable X has a Beta B(, ) distribution, the mode
of the pdf occurs at:
1
(9.188)
x =
+2
and hence deduce that (a) no mode exists when 0 < < 1, and + > 2; and
(b) that when a mode exists, this mode and the mean will coincide if, and only if,
= . (You may simply recall the expression for the mean given in the text; you
need not rederive it.)
9.28 The Beta-Binomial mixture distribution arises from a Binomial Bi(n, p) random variable, X, whose parameter p, rather than being constant, has a Beta distribution, i.e., it consists of a conditional distribution,

n x
f (x|p) =
p (1 p)nx
x
in conjunction with the marginal distribution for p,
f (p) =

( + ) 1
p
(1 p)1 ; 0 < x < 1; > 0; > 0
()()

Obtain the expression for f (x), the resulting Beta-Binomial pdf.

9.29 Given a random variable X with a standard uniform U (0, 1) distribution, i.e.,
f (x) = 1; 0 < x < 1
and any two points a1 , a2 in the interval (0,1), such that a1 < a2 and a1 + a2 1,
show that
P [a1 < X < (a1 + a2 )] = a2 .
In general, if f (x) is uniform in the interval (a, b), and if a a1 , a1 < a2 and
a1 + a2 b, show that:
P [a1 < X < (a1 + a2 )] =

a2
(b a)

9.30 For a random variable X that is uniformly distributed over the interval (a, b):
(i) Determine P (X > [a + (1 )b]); 0 < < 1;
(ii) For the specic case where a = 1, b = 3, determine P (X 2X < X < X +2X )
where X is the mean of the random variable, and X is the positive square root of
its variance.
(iii) Again for the specic case where a = 1, b = 3, nd the symmetric interval
(1 , 2 ) around the mean, X , such that P (1 < X < 2 ) = 0.95
9.31 Consider the random variable, X, with pdf:
f (x) = ( 1)x ; x 1; > 2
known as a Pareto random variable.
(i) Show that for this random variable,
E(X) =

1
; > 2
2

(9.189)

Ideal Models of Continuous Random Variables

329

(ii) Determine the median and the variance of X.

9.32 Given a random variable X that has an F (49, 49) distribution, determine the
following probabilities:
(i) P (X 1)
(ii) P (X 2); P (X 0.5)
(iii) P (X 1.76); P (X 0.568)
9.33 Given a random variable X that has an F (1 , 2 ) distribution, determine the
value x0 such that (a) P (X x0 ) = 0.025; (b) P (X x0 ) = 0.025 for the following
specic cases:
(i) 1 = 2 = 49
(ii) 1 = 2 = 39
(iii) 1 = 2 = 29
(iv) Comment on the eect that reducing the degrees of freedom has on the various
values of x0 .
9.34 Given a random variable X that has a t() distribution, determine the
value of x0 such that P (|X| < x0 ) = 0.025 for (i) = 5; (ii) (i) = 25; (iii)
= 50; (iv) = 100. Compare your results with the single value of x0 such that
P (|X| < x0 ) = 0.025 for a standard normal random variable X.
9.35 Plot on the same graph the pdfs for a Cauchy C(5, 4) random variable and for
a Gaussian N (5, 4) random variable. Compute the probability P (X + 1.96)
for each random variable, where, for the Gaussian random variable, and are the
mean and standard deviation (positive square root of the variance) respectively; for
the Cauchy distribution, and are, the location and scale parameters, respectively.
9.36 Plot on the same graph the pdf for the logistic distribution given in Eq (9.186)
of Exercise 9.19 and that of the standard Cauchy random variable. Which pdf has
the heavier tails?

APPLICATION PROBLEMS
9.37 The waiting time in days between the arrival of tornadoes in a county in south
central United States is known to be an exponentially distributed random variable
whose mean value remains constant throughout the year. Given that the probability
is 0.6 that more than 30 days will elapse between tornadoes, determine the expected
number of tornadoes in the next 90 days.
9.38 The time-to-failure, T , of an electronic component is known to be an exponentially distributed random variable with pdf

et ; 0 < x <
(9.190)
f (t) =
0;
elsewhere
where the failure rate, = 0.075 per 100 hours of operation.
(i) If the component reliability function Ri (t) is dened as
Ri (t) = P (T > t)

(9.191)

330

Random Phenomena

i.e., the probability that the component functions at least up until time t, obtain an
explicit expression for Ri (t) for this electronic component.
(ii) A system consisting of two such components in parallel functions if at least one of
them functions. Again assuming that both components are identical, nd the system
reliability Rp (t) and compute Rp (1000), the probability that the system survives at
least 1,000 hours of operation.
9.39 Life-testing results on a rst generation microprocessor-based (computercontrolled) toaster indicate that X, the life-span (in years) of the central control
chip, is a random variable that is reasonably well-modeled by the exponential pdf:
f (x) = ex ; x > 0

(9.192)

with = 0.16. A malfunctioning chip will have to be replaced to restore proper

toaster operation.
(i) The warranty for the chip is to be set at xw years (in integers) such that no more
than 15% would have to be replaced before the warranty period expires. Find xw .
(ii) In planning for the second generation toaster, design engineers wish to set a
target value = 2 to aim for such that 85% of the second generation chips survive
beyond 3 years. Determine 2 and interpret your results in terms of the implied
fold increase in mean life-span from the rst to the second generation of chips.
9.40 The table below shows frequency data on distances between DNA replication
origins (inter-origin distances), measured in vivo in Chinese Hamster Ovary (CHO)
cells by Li et al., (2003)7 , as reported in Chapter 7 of Birtwistle (2008)8 . The data
is similar to that in Fig 9.3 in the text.
Inter-Origin
Distance (kb)
x
0
15
30
45
60
75
90
105
120
135
150
165

Relative
Frequency
fr (x)
0.00
0.02
0.20
0.32
0.16
0.08
0.11
0.03
0.02
0.01
0.00
0.01

(i) Determine the mean (average) and variance of the CHO cells inter-origin distance.
(ii) If this is a gamma distributed random variable, use the results in (i) to provide
reasonable values for the gamma distribution parameters. On the same graph, plot
the frequency data and the gamma model t. Comment on the model t to the data.
7 Li, F., Chen, J., Solessio, E. and Gilbert, D. M. (2003). Spatial distribution and specication of mammalian replication origins during G1 phase. J Cell Biol 161, 257-66.
8 M. R. Birtwistle, (2008). Modeling and Analysis of the ErbB Signaling Network: From
Single Cells to Tumorigenesis, PhD Dissertation, University of Delaware.

Ideal Models of Continuous Random Variables

331

(iii) It is known that DNA synthesis is initiated at replication origins, which are
distributed non-uniformly throughout the genome, at an average rate of r origins
per kb. However, in some mammalian cells, because there is a non-zero probability
that any particular replication origin will not re, some potential origins are skipped
over so that in eect, k of such skips must take place (on average) before DNA
synthesis can begin. What do the values estimated for the gamma distribution imply
about the physiological parameters r and k?
9.41 The storage time (in months) until a collection of long-life Li/SO4 batteries
become unusable was modeled in Morris (1987)9 as a Weibull distributed random
variable with = 2 and = 10. Let us refer to this variable as the batterys
maximum storage life, MSL.
(i) What is the most likely value of the MSL? (By denition, the most likely
value is that value of the random variable for which the pdf attains a maximum.)
(ii) What is the median MSL? By how much does it dier from the expected MSL?
(iii) What is the probability that a battery has an MSL value exceeding 18 months?
9.42 A brilliant paediatrician has such excellent diagnostic skills that without resorting to expensive and sometimes painful tests, she rarely misdiagnoses what ails
her patients. Her overall average misdiagnosis rate of 1 per 1,000 consultations is
all the more remarkable given that many of her patients are often too young to
describe their symptoms adequatelywhen they can describe them at all; the paediatrician must therefore often depend on indirect information extracted from the
parents and guardians during clinic visits. Because of her other responsibilities in
the clinic, she must limit her patient load to precisely 10 per day for 325 days a year.
While the total number of her misdiagnoses is clearly a Poisson random variable, the
Poisson parameter = t is not constant because of the variability in her patient
population, both in age and in the ability of parents and guardians to communicate
eectively on behalf of their non-verbal charges. If has a gamma distribution with
= 13, = 0.25,
(i) Determine the probability that the paediatrician records exactly 3 misdiagnoses
in a year; determine also the probability of recording 3 or fewer misdiagnoses.
(ii) Compare these probabilities with the corresponding ones computed using a standard Poisson model with a constant parameter.
9.43 The year-end bonuses of cash, stock and stock options (in thousands of US
dollars) given to senior technical researchers in a leading chemical manufacturing
company, each year over a ve-year period from 1990-1994, has a lognormal distribution with scale parameter = 3 and shape parameter = 0.5.
(i) Determine the probability that someone selected at random from this group
received a bonus of $20, 000 or higher.
(ii) If a bonus of $100, 000 or higher is considered a Jumbo bonus, what percentage
of senior technical researchers received such bonuses during this period?
(iii) If a bonus in the range $10, 000$30, 000 is considered more typical, what percentage received this typical bonus?
9 Morris, M.D. (1987). A sequential experimental design for estimating a scale parameter
from quantal life testing data. Technometrics, 29, 173-181

332

Random Phenomena

9.44 The home prices (in thousands of dollars) in a county located in the upper midAtlantic region of the United States is a lognormal random variable with a median
of 403 and a mode of 245.
(i) What percentage of the homes in this region cost more than $500,000?
(ii) If a home is considered aordable in this region if it costs between $150,000
and $300,000, what percentage of homes fall into this category?
(iii) Plot the pdf for this random variable. Compute its mean and indicate its value
on the plot along with the value given for the median. Which seems to be more
representative of the central location of the distribution, the mean or the median?
9.45 If the proportion of students who obtain failing grades on a foreign Universitys
highly competitive annual entrance examination can be considered as a Beta B(2, 3)
random variable,
(i) What is the mode of this distribution, and what percentage of students can be
expected to fail annually?
(ii) Determine the probability that over 90% of the students will pass this examination in any given year.
(iii) The proportion of students from an elite college preparatory school (located
in this same foreign country) who fail this same entrance examination has a Beta
B(1, 7) distribution. Determine the percentage of this select group of students who
can be expected to fail; also, determine the probability that over 90% of these elite
students will pass this examination in any given year.
(iv) Do these results mean that the elite college preparatory school does better in
getting its students admitted into this highly selective foreign University?
9.46 The place kicker on a team in the American National Football League (NFL)
has an all-time success rate (total number eld goals made divided by total number
of eld goals attempted) of 0.82 on eld goal attempts of 55 yards or shorter. An
attempt to quantify his performance with a probability model resulted in a Beta
B(4.5, 1) distribution.
(i) Is this model consistent with the computed all-time success rate?
(ii) To be considered an elite place kicker, the success rate from this distance (D
55) needs to improve to at least 0.9. Determine the probability that this particular
place kicker achieves elite status in any season, assuming that he maintains his
current performance level.
(iii) It is known that the computed probability of attaining elite status is sensitive
to the model parameters, especially . For the same xed value = 1, compute
the probability of attaining elite status for the values = 3.5, 4.0, 4.5, 5.0, 5.5. Plot
these probabilities as a function of .
9.47 If the uorescence signals obtained from a test spot and the reference spot on
a microarraya device used to quantify changes in gene expressionis represented
as random variables X1 and X2 respectively, it is possible to show that if these
variables can be assumed to be independent, then they are reasonably represented
by gamma distributions. In this case, the fold change ratio
Y =

X1
X2

(9.193)

indicative of the fold increase (or decrease) in the signal intensity between test

Ideal Models of Continuous Random Variables

333

and reference conditions, has the inverted Beta distribution, with the pdf
f (y) =

( + )
y 1
; y > 0; > 0; > 0
()() (1 + y)+

(9.194)

The theoretical distribution with parameters = 4.8; = 2.1 t the fold change
ratio for a particular set of data. Because of the detection threshold of the measurement technology, the genes in question are declared to be overexpressed only
if Y 2; if 0.5 Y < 2, the conclusion is that there is insucient evidence of
dierential expression.
(i) Determine the expected fold change ratio.
(ii) Determine the probability that a gene selected at random from this population
will be identied as overexpressed.
(iii) Determine the probability that there will be insucient evidence to conclude
that there is dierential expression. (Hint: it may be easier to invoke the result that
the variable Z, dened as Z = X1 /(X1 + X2 ), has a Beta distribution with the
same values of and given for the inverted Beta distribution. In this case, the
probabilities can be computed in terms of Z rather than Y .)
9.48 The sample variance for the yield data presented in Chapter 1 may be determined as s2A = 2.05 for process A, and s2B = 7.62 for process B. If, but for random
variability in the data, these variances are the same, then the ratio
xAB =

s2A
s2B

should be approximately equal to 1, but will not be exactly so (because of random

variability). Given that if these two variances are theoretically the same, then this
ratio is known to be a random variable X with an F (49, 49) distribution,
(i) Determine the probability P (X xAB ), that this random variable takes a value
as large as, or larger than, the computed xAB by pure chance alone, when the two
variances are in fact the same.
(ii) Determine the values f1 and f2 such that
P (X f1 )

0.025

P (X f2 )

0.025

(iii) What do these results imply about the plausibility of the conjecture that but
for random variability, the variances of the data obtained from the two processes are
in fact the same?

334

Random Phenomena

Chapter 10
Information, Entropy and Probability
Models

10.1 Uncertainty and Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10.1.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1.2 Quantifying Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Maximum Entropy Principles for Probability Modeling . . . . . . . . . . . . . . . . . .
10.4 Some Maximum Entropy Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4.1 Discrete Random Variable; Known Range . . . . . . . . . . . . . . . . . . . . . . . .
10.4.2 Discrete Random Variable; Known Mean . . . . . . . . . . . . . . . . . . . . . . . . .
10.4.3 Continuous Random Variable; Known Range . . . . . . . . . . . . . . . . . . . .
10.4.4 Continuous Random Variable; Known Mean . . . . . . . . . . . . . . . . . . . . .
10.4.5 Continuous Random Variable; Known Mean and Variance . . . . . .
10.4.6 Continuous Random Variable; Known Range, Mean and Variance
10.5 Maximum Entropy Models from General Expectations . . . . . . . . . . . . . . . . . .
10.5.1 Single Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Continuous Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5.2 Multiple Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

336
336
337
338
338
340
344
344
345
346
347
348
349
350
351
351
351
352
352
354
355
357
360

For since the fabric of the universe is most perfect

and the work of a most wise Creator,
nothing at all takes place in the universe in which
some rule of maximum or minimum does not appear
Leonhard Euler (17071783)

The dening characteristic of the random variable is that uncertainty in individual outcomes co-exists with regularity in the aggregate ensemble. This aggregate ensemble is conveniently characterized by the pdf, f (x); and, as shown
in the preceding chapters, given all there is to know about the phenomenon
behind the random variable, X, one can derive expressions for f (x) from rst
principles. There are many practical cases, however, where the available information is insucient to specify the full pdf. One can still obtain reasonable
probability models from such incomplete information, but this will require an
alternate view of the outcomes of random experiments in terms of the infor335

336

Random Phenomena

mation each conveys, from which derives the concept of the entropy of a
random variable. This chapter is concerned rst with introducing the concept
of entropy as a means of quantifying uncertainty in terms of the amount of
information conveyed by the observation of a random variables outcome. We
then subsequently present a procedure that utilizes entropy to specify full pdfs
in the face of incomplete information. The chapter casts most of the results
of the two previous chapters in a dierent context that will be of interest to
engineers and scientists.

10.1

Uncertainty and Information

10.1.1

Basic Concepts

Consider that after performing an experiment, a discrete random variable

X is observed to take on the specic value xi for example, after running
an experimental ber spinning machine for 24 hours straight, X, the total
number of line breaks during this period, is found to be 3. It is of interest to
ask:
How much information about X is conveyed in the result X =
xi ? i.e. How much does the observation that X = xi add to our
knowledge about the random variable X?
Equivalently, we could ask: Prior to making the observation, how much uncertainty is associated with predicting the outcome X = xi ? thereby recasting
the original question conversely as:
How much uncertainty is resolved by the specic observation that
X = xi ?
Clearly the answer to either version of this question depends on how much
inherent variability is associated with the random variable X. If P (X = xi )
is relatively high (making it more likely than not that the event X = xi will
occur), the actual observation will not be very informative in the complementary sense that (i) we could have deduced this outcome with a high degree
of certainty before performing the experiment; and (ii) there was not much
uncertainty to be resolved by this observation. In the extreme, the occurrence
of a certain event is thus entirely uninformative since there was no uncertainty
to begin with, and therefore the observation adds nothing to what we already
knew before the experiment.
Conversely, if P (X = xi ) is relatively low, it is less likely that X will take
the value xi , and the actual observation X = xi will be quite informative.
The high degree of uncertainty associated with predicting `
a-priori the occurrence of a rare event makes the actual observation of such an event (upon
experimentation) very informative.

Information, Entropy and Probability Models

337

Let us illustrate with the following example. Case 1 involves a bag containing exactly two balls, 1 red, 1 blue. For Case 2, we add to this bag 10 green,
10 black, 6 white, 6 yellow, 3 purple and 3 orange balls to bring the total to
40 balls. The experiment is to draw a ball from this bag and to consider for
each case, the event that the drawn ball is red. For Case 1, the probability
of drawing a red ball, P1 (Red), is 1/2; for Case 2 the probability P2 (Red) is
1/40. Drawing a red ball is therefore considered more informative in Case 2
than in Case 1.
Another perspective of what makes the drawing of a red ball more informative in Case 2 is that P2 (Red)=1/40 indicates that the presence of a red
ball in the Case 2 bag is a fact that will take a lot of trials on average to
ascertain. On the other hand, it requires two trials, on average, to ascertain
this fact in the Case 1 bag.
To summarize:
1. The information content of (or uncertainty associated with) the statement X = xi increases as P (X = xi ) decreases;
2. The greater the dispersion of the distribution of possible values of X, the
greater the uncertainty associated with the specic result that X = xi
and the lower the P (X = xi ).
We now formalize this qualitative conceptual discussion.

10.1.2

Quantifying Information

For the discrete random variable X, let P (X = xi ) = f (xi ) = pi ; dene

I(X = xi ) (or simply I(xi )) as the information content in the statement that
the event X = xi has occurred. From the discussion above, we know that I(xi )
should increase as pi decreases, and vice versa. Formally, akin to the axioms
of probability, I(xi ) must satisfy the following conditions:
1. I(xi ) 0; i.e. it must be non-negative;
2. For a certain event, I(xi ) = 0;
3. For two stochastically independent random variables X, Y , let P (X =
xi ) = pi and P (Y = yi ) = qi be the probabilities of the outcome of each
indicated event; and let I(xi ) and I(yi ) be the respective information
contents.
The total information content in the statement X = xi and Y = yi is
the sum:
I(xi , yi ) = I(xi ) + I(yi )
(10.1)
This latter condition must hold because by their independence, the occurrence
of one event has no eect on the occurrence of the other; as such one piece

338

Random Phenomena

of information is additional to the other, even though, also by independence,

the probability of joint occurrence is the product:
P (X = xi , Y = yi ) = pi qi

(10.2)

Claude Shannon in 1948 established that, up to a multiplicative constant,

the desired unique measure of information content is dened by:1
I(X = xi ) = log2 P (X = xi ) = log2 f (xi )

(10.3)

Note that this function satises all three conditions stated above.

10.2

Entropy

10.2.1

Discrete Random Variables

For the discrete random variable, X, and not just for a specic outcome
X = xi , Shannon suggests E[ log2 f (x)], the average or mean information
content, as a suitable measure of the information content in the pdf f (x). This
quantity, known as the entropy of the random variable, is dened by:
H(X) = E[ log2 f (x)] =

n

[ log2 f (xi )]f (xi )

(10.4)

i=1

Expressed in this form, H(X) has units of bits (for binary digits), a term
that harks back to the original application for which the concepts were
developeda problem involving the characterization of the average minimum binary codeword length required to encode the output of an information
source.
The expression for entropy is also sometimes written in terms of natural
logarithms (with the matching, if whimsical, unit of nats) as:
n

[ ln f (xi )]f (xi )
H(X) = E[ ln f (x)] =

(10.5)

i=1

One form diers from the other by only a multiplicative constant (specically
ln 2).
Example 10.1 ENTROPY OF A DETERMINISTIC VARIABLE
Compute the entropy of a variable X that takes on the value x0 with
1 Shannon, C.E. (1948). A mathematical theory of communication. Bell System Tech. J.,
27, 379-423 and 623-656.

Information, Entropy and Probability Models

339

probability 1, i.e. a deterministic variable that is always equal to x0 .

Solution:
Since for this variable, f (x0 ) = 1 and 0 otherwise, by denition,
H(X) = log2 (1) 1 = 0

(10.6)

so that the entropy of a deterministic variable is zero.

This example illustrates that when there is no uncertainty associated with a

random variable, its entropy is zero.
Example 10.2 ENTROPY OF A DISCRETE UNIFORM
RANDOM VARIABLE
Compute the entropy of the discrete uniform random variable, X
UD (k), whose pdf is given by:
f (xi ) =

1
; i = 1, 2, . . . , k.
k

(10.7)

Solution:
In this case
H(X)

=
=

1
1
k
k
i=1

1
1
k log2
= log 2 k
k
k

log2

(10.8)
(10.9)

In this last example, note that:

1. In the limit as k (i.e. the random variable can take any of an
innite number of possible discrete values), H(X) . Thus, as k
the uncertainty in X increases to the worst possible limit of complete
uncertainty; the entropy also increases to match.
2. As k becomes smaller, uncertainty is reduced and H(X) also becomes
smaller; and when k = 1, the random variable becomes deterministic
and H(X) = 0, as obtained earlier in Example 10.1.
Example 10.3 ENTROPY OF A BERNOULLI (BINARY)
RANDOM VARIABLE
Compute the entropy of the Bernoulli random variable whose pdf is
given by:

1 p x = 0;
p
x = 1;
f (x) =
(10.10)

0
elsewhere.
Solution:
By denition, the entropy for this random variable is
H(X) = (1 p) log 2 (1 p) p log2 p

(10.11)

340

Random Phenomena
H(x) vs p
1.0

0.8

H(x)

0.6

0.4

0.2

0.0
0.0

0.2

0.4

0.6

0.8

1.0

FIGURE 10.1: The entropy function of a Bernoulli random variable

This is a symmetric function of p which, as shown in the gure below,

attains a maximum H (x) = 1 when p = 0.5. Thus, the maximum entropy for a binary random variable is 1 bit, attained when the outcomes
are equiprobable.

10.2.2

Continuous Random Variables

For continuous variables, the statement X = xi is somewhat meaningless

since it indicates an impossible event with a zero probability of occurrence. As
dened in Eq (10.4) therefore, the entropy of all continuous random variables
is innite, which is not very useful. To extend the concept of entropy in a more
useful manner to fundamentally continuous random variables thus requires
additional considerations. Specically, we must now introduce the concepts of
quantization (or discretization) and dierential entropy, as follows.
Quantization
For a continuous random variable X with a pdf f (x), we know that if the set
A = {x : a < x < b}, then P (A) = P (X A) is dened by:

f (x)dx

P (A) = P (a < X < b) =

(10.12)

Let us now consider the case in which the interval [a, b] is divided into n
subintervals of equal length x, so that the ith subinterval Ai = {x : xi <
x < xi + x}. Then, we recall that, for suciently small x, the probability

Information, Entropy and Probability Models

341

that X takes a value in this ith subinterval is given by:

P (Ai ) = P (xi < X < (xi + x)) f (xi )x

(10.13)

And since A is the union of n disjoint sets Ai , it follows from Eq (10.13) that:
P (A) =

P (Ai )

i=1

f (xi )x

(10.14)

i=1

from where we obtain the familiar result that in the limit as the quantization
interval length x 0, the sum in (10.14) approaches the Riemann integral
in (10.12), and in addition, the approximation error vanishes. But as it stands,
the expression in (10.13) is a statement of the dierential probability that
X takes on a value in the dierential interval between xi and xi + x for
small, but non-zero, x.
Now, let Q(X) be a quantization function that places the continuous random variable anywhere in any one of the n quantized subintervals such that
by Q = xi we mean that xi < X < xi + x, or X Ai . Then P (Q = xi ) is
given by:
P (Q = xi ) = P (xi < X < xi + x) f (xi )x
(10.15)
as in Eq (10.13). Since Q is discrete, we may compute its entropy as:
H(Q) =

log2 P (Q = xi )P (Q = xi )

(10.16)

i=1

so that, from Eq (10.15)

H(Q)

[log2 f (xi )x]f (xi )x

i=1

= log2 (x)

[log2 f (xi )]f (xi )x

(10.17)

i=1

Some remarks are in order here:

1. In the limit as x 0, the sum in Eq (10.17) becomes (in many cases
of practical interest) the integral

H(X)
=

[log2 f (x)]f (x)dx

(10.18)

but this is not enough to prevent the entropy H(Q) from increasing
without limit because of the log2 (x) term. Thus, not surprisingly, the
exact (not approximate) entropy of a continuous random variable again
turns out to be innite, as noted earlier.

342

Random Phenomena

2. For non-zero x, the entropy of the quantized version of the continuous

random variable X, is nite, but there is residual quantization error;
in the limit as x 0, the quantization error vanishes but then the
entropy becomes innite.
3. This implied trade-o between quantization accuracy and entropy of a
continuous random variable is a well-known issue in information theory:
an innite number of bits is required to specify a continuous random
variable exactly; any nite-bit representation is achieved only at the expense of quantization error.
Dierential Entropy
Let us now dene a function h(X) such that
H(X) = log2 h(X)

(10.19)

then it follows from Eq (10.17) that for the quantized random variable,
log2 h(Q) log2 (x)

[log2 f (xi )]f (xi )x

(10.20)

i=1

so that:
log2 [h(Q)x]

[log2 f (xi )]f (xi )x

(10.21)

i=1

In the limit as x 0, in the same spirit in which f (x)dx is considered the

dierential probability that X is in the interval [x, x + dx], we may similarly

dene the dierential entropy H(X)

of a continuous random variable, X,
such that, by analogy with Eq (10.19),

H(X)
= log2 [h(X)dx]

(10.22)

then we obtain the following result:

For the continuous random variable X, the dierential entropy is dened
as:

H(X) =
[ log2 f (x)]f (x)dx
(10.23)

an expression that is reminiscent of the denition of the entropy of a discrete

random variable given in (10.4), with the sum replaced by the integral. Thus,
even though the entropy of a continuous random variable is innite, the dierential entropy is nite, and it is to entropy what f (x)dx is to probability. And
just as the discrete pdf f (xi ) is fundamentally dierent from the continuous
f (x), so is the entropy function fundamentally dierent from the dierential
entropy function.
Nevertheless, even though the proper continuous variable counterpart of
the discrete random variables entropy is the dierential entropy dened in
(10.23), we will still, for notational simplicity, use the same symbol H(X) for

Information, Entropy and Probability Models

343

both, so long as we understand this to represent entropy in the discrete case

and dierential entropy when X is continuous. (We crave the indulgence of
the reader to forgive this slight but standard abuse of notation.)
Example 10.4 (DIFFERENTIAL) ENTROPY OF A CONTINUOUS UNIFORM RANDOM VARIABLE
Compute the (dierential) entropy of the continuous uniform random
variable whose pdf is given by:
1
a<x<b
ba
(10.24)
f (x) =
0
otherwise
Solution:
The dierential entropy is given by

b
1 1
dx
log 2
H(X) =
c c
a

(10.25)

where c = (b a). This simplies to

H(X) = log 2 c

(10.26)

which should be compared with the entropy obtained for the discrete
uniform random variable in Example 10.2.

Some remarks about this example:

1. In the limit as a and b , Eq (10.24) becomes a model for a
random variable about which nothing is known, whose range of possible
values is < x < . For such a variable, even the dierential entropy
goes to innity, since c above in this case.
2. If we know that X is restricted to take values in a nite interval [a, b],
adding such constraining information reduces H(X) to a nite value
which depends on the interval length, c, as shown above. The longer
this interval, (i.e. the more disperse f (x)) the higher the (dierential)
entropy.
3. Thus in general, adding information about the random variable X reduces its entropy; conversely, to reduce entropy, one must add information.
We may now summarize the key points regarding the entropy of a (continuous
or discrete) random variable X.
1. If nothing is known about a random variable, X, (making it innitely
uncertain) its entropy, H = ; if it is known perfectly with no uncertainty, H = 0;
2. Any information available about the behavior of the random variable
in the form of some proper pdf, f (x), reduces the entropy from the
absolute ignorance state with H = to the nite, but still non-zero,
entropy associated with the random variables pdf, f (x).

344

Random Phenomena

We now discuss how these concepts can be used to obtain useful probability
models in the face of incomplete knowledge.

10.3

Maximum Entropy Principles for Probability Modeling

When only limited information (perhaps in the form of its range, mean,
, variance, 2 , or more generally, the expectation of some function G(X))
is all that is available about a random variable X, clearly it is not possible
to specify the full pdf f (x) uniquely because many pdfs exist that have the
same range, or mean or variance, or whatever partial information has been
supplied. The required full but unknown pdf contains additional information
over and above what is legitimately known. To postulate a full pdf from such
partial information therefore requires that we incorporate extra information
to ll in what is missing. The problem at hand may be stated as follows:

How should we choose an appropriate f (x) to use in representing the random variable X given only a few of its characteristic
parameters (, 2 , range, . . .) and nothing else?

The maximum entropy principle states that the f (x) that adds the least
amount of extra information, (i.e., the one with maximum entropy) should
be chosen. The resulting pdf, f (x), is then referred to as the maximally
unpresumptive distribution because, of all the possible pdfs with the same
characteristic parameters as those specied in the problem, f (x) is the least
presumptive. Such a pdf is also called a maximum entropy model and the
most common ones will now be derived.

10.4

Some Maximum Entropy Models

The procedure for obtaining maximum entropy models involves posing the
optimization problem:
max {H(X) = E[ ln f (x)]}
f (x)

(10.27)

Information, Entropy and Probability Models

345

(where we have chosen the entropy function representation in nats for convenience) and solving it subject to the known information as constraints, as
we now illustrate for several cases.

10.4.1

Discrete Random Variable; Known Range

We begin with a random variable X for which the only known fact is that
it can take on any of k discrete values x1 , x2 , . . . , xk . What is an appropriate
probability model for such as random variable?
Problem statement: Obtain f (xi ), i = 1, 2, . . . , k, given only that
k

f (xi ) = 1.

(10.28)

i=1

The maximum entropy solution seeks to maximize

H(X) =

[ln f (xi )]f (xi )

(10.29)

i=1

subject to Eq (10.28) as a constraint. This problem, and all subsequent ones,

will be solved using principles of the calculus of variations (see for example2 ).
The Lagrangian functional for this problem is obtained as:
(f ) =

[ ln f (xi )]f (xi )

i=1

f (xi ) 1

(10.30)

i=1

where is a Lagrange multiplier. The optimum f and the optimum value for
are obtained from the Euler equations:

= 0

(10.31)

= 0

(10.32)

where the second equation merely recovers the constraint. In the particular
case at hand, we have

1

=
f (xi ) + ln f (xi ) = 0
(10.33)
f
f (xi )
which yields, upon simplication,
f (xi ) = e(1+) = C
2 Weinstock,

R. (1974), Calculus of Variations, Dover Publications

(10.34)

346

Random Phenomena

C is a constant to be determined such that Eq (10.28) is satised; i.e.

f (xi ) =

i=1

C=1

(10.35)

i=1

so that

1
(10.36)
k
with the nal result that the maximum entropy distribution for this discrete
random variable X is:
C=

f (xi ) =

1
; i = 1, 2, 3, . . . , k
k

(10.37)

Thus: the maximum entropy principle assigns equal probabilities to each of the
k outcomes of a discrete random variable when nothing else is known about
the variable.
This is a result in perfect keeping with intuitive common sense; in fact
we have made use of it several times in our previous discussions. Note also
that in Example 10.3, we saw that the Bernoulli random variable attains its
maximum entropy for equiprobable outcomes.

10.4.2

Discrete Random Variable; Known Mean

The random variable X in this case can take on discrete values 1, 2, 3, . . .,

and we seek an appropriate f (xi ) that is unknown except for

f (xi )

= 1

(10.38)

xi f (xi )

(10.39)

i=1

by maximizing the entropy, subject to these two equations as constraints.

In this case, the Lagrangian is:

(f ) =
[ ln f (xi )]f (xi ) 1
f (xi ) 1 2
xi f (xi )
i=1

i=1

(10.40)
and the resulting Euler equations are:

= (ln f (xi ) + 1) 1 2 xi = 0
f

(10.41)

along with the other partial derivatives with respect to the Lagrange multipliers 1 , 2 that simply recover the two constraints. From Eq (10.41), we

Information, Entropy and Probability Models

347

obtain:
f (xi )

xi
= e(1+1 ) e2
= abxi

(10.42)

where the indicated constants a, b (functions of 1 and 2 as implied above) are

determined by substituting (10.42) into the constraints equations, to obtain:

abx = 1

(10.43)

x=1

which, for |b| < 1, converges to give:

b
1b

(10.44)

Similarly,

xabx =

(10.45)

b
=
(1 b)2

(10.46)

x=1

yields:
a

Solving (10.44) and (10.46) simultaneously for a and b now gives:

1
1
1

(10.47)
(10.48)

To tidy things up, if we now let p = 1/ (so that = 1/p), then

b
a

= (1 p)
p
=
1p

(10.49)
(10.50)

with the nal result that the pdf we seek is given by:
f (x) = p(1 p)x1

(10.51)

which is immediately recognizable as the pdf of the geometric random variable.

Thus: the maximum entropy principle prescribes a geometric pdf for a discrete
random variable with VX = {1, 2, 3, . . . , } and for which the mean = 1/p
is known.

348

10.4.3

Random Phenomena

Continuous Random Variable; Known Range

In this case, X is a continuous random variable with VX = {x : a < x < b}

for which we seek an appropriate f (x) that is unknown except for:

f (x)dx = 1

(10.52)

We will nd f (x) by maximizing the (dierential) entropy

H(X) =

[ ln f (x)]f (x)dx

(10.53)

subject to (10.52). The Lagrangian in this case takes the form:

[ ln f (x)]f (x)dx

(f ) =
a

f (x)dx 1

(10.54)

giving rise to the Euler equations:

(ln f (x) + 1) = 0

(10.55)

(10.56)

again with the latter recovering the constraint. From (10.55) we obtain:
f (x) = e(1+) = c

(10.57)

a constant whose value is obtained from (10.52) as

cdx = c[b a] = 1

(10.58)

1
ba

(10.59)

or,
c=
Hence, the prescribed pdf is:
f (x) =

1
;a < x < b
ba

(10.60)

which is recognizable as the continuous uniform pdf. This, of course, is the

continuous version of the result obtained earlier for the discrete random variable encountered in Section 9.4.1.

Information, Entropy and Probability Models

10.4.4

349

Continuous Random Variable; Known Mean

We seek a pdf f (x) that is unknown except for:

f (x)dx

(10.61)

xf (x)dx

(10.62)

The Lagrangian for maximizing the dierential entropy in this case is:

(f ) =

[ ln f (x)]f (x)dx1

f (x)dx 1 2

xf (x)dx
(10.63)

The resulting Euler equations are:

= (ln f (x) + 1) 1 2 x = 0
f

(10.64)

along with the constraints in (10.61) and (10.62). From (10.64) we obtain:
f (x)

=
=

e(1+1 ) e2 x
C1 e2 x

(10.65)

Substituting this back into (10.61) and (10.62) gives:

C1 e2 x dx

C1 xe2 x dx

(10.66)

which, when solved simultaneously for C1 and 2 , gives the result:

C1 = 1/; 2 = 1/

(10.67)

for VX = {x : 0 x < } (required for the integrals in (10.66) to be nite),

so that (10.65) becomes:

f (x) =

1 x/
e

x0
otherwise

(10.68)

This is recognizable as the pdf of an exponential random variable (the continuous version of the result obtained earlier in section 9.4.2). Thus: the maximum
entropy principle prescribes an exponential pdf for the continuous random variable for which only the mean is known.

350

10.4.5

Random Phenomena

Continuous Random Variable; Known Mean and

Variance

We seek a pdf f (x) that is unknown except for:

f (x)dx = 1

xf (x)dx =

(x )2 f (x)dx = 2

(10.69)
(10.70)
(10.71)

Once more, we work with the Lagrangian, in this case:

(f ) =
[ ln f (x)]f (x)dx 1
f (x)dx 1 2
xf (x)dx

3
(x )2 f (x)dx 2
(10.72)

The Euler equations are:

= ln f (x) 1 1 2 x 3 (x )2 = 0
f

(10.73)

along with the three constraints in (10.69 10.71). Solving (10.73) gives:
f (x) = C1 e2 x e3 (x)

Substituting this back into the constraints and using the result:
)

2

eau =
a

(10.74)

(10.75)

upon some algebraic manipulations, yields:

2
0;
1
2 2

(10.76)
(10.77)
(10.78)

giving as the nal result:

1
f (x) = e
2

(x)2
22

(10.79)

the familiar Gaussian pdf. Thus, when only the mean, , and the variance
2 are all that we legitimately know about a continous random variable, the
maximally unpresumptive distribution f (x) is the Gaussian pdf.

Information, Entropy and Probability Models

10.4.6

351

Continuous Random Variable; Known Range, Mean

and Variance

We present, without proof or detailed derivations, that for X continuous,

given the range (0,1), the mean, , and variance, 2 , the maximum entropy
model is the beta B(, ) pdf:
( + ) 1
x
(1 x)1
()()

f (x) =

10.5
10.5.1

(10.80)

Maximum Entropy Models from General Expectations

Single Expectations

Suppose that X is a random variable (discrete or continuous) dened on

the space VX and whose pdf f (x) is unknown, except that E[G(X)], the
expected value of some (continuous) function G(X), is some known constant,
, i.e.
n

G(xi )f (xi ) =
(10.81)
i=1

for discrete X, or

G(x)f (x)dx =

(10.82)

for continuous X. Given this single piece of information, expressions for the
resulting maximum entropy models for any G(X) will now be derived.
Discrete Case
The Lagrangian in the discrete case is given by:
n

n

(f ) =
[ ln f (xi )]f (xi )
f (xi )G(xi )
i=1

(10.83)

i=1

which is easily rearranged to give:

f (xi )
(f ) =
f (xi ) ln
CeG(xi )
i=1

(10.84)

It can be shown that as presented in Eq (10.84), (f ) 0, attaining its

maximum value of 0 when
f (xi ) = CeG(xi )

(10.85)

352

Random Phenomena

This, then, is the desired maximum entropy model for any function G(X) of
the discrete random variable, X; the constants C and are determined such
that f (xi ) satises the constraint:

f (xi ) = 1

(10.86)

i=1

and the supplied expectation information in Eq (10.81).

Continuous Case
Analogous to the discrete case, the Lagrangian in the continuous case is:

[ ln f (x)]f (x)dx
f (x)G(x)dx
(10.87)
(f ) =

which again rearranges to:

f (x)
(f ) =
f (x) ln
dx
CeG(x)

(10.88)

It can also be shown that (f ) in Eq (10.88) is maximized when:

f (x) = CeG(x)

(10.89)

Again, this represents the maximum entropy model for any function G(X)
of the continuous random variable X, with the indicated constants to be
determined to satisfy the constraint:

f (x)dx = 1
(10.90)

and the given expectation in Eq (10.82).

It is a simple enough exercise (see Exercises 10.8 and 10.9) to establish the
following results and hence conrm, from this alternative route, the maximum
entropy results presented earlier in section 10.4:
1. If G(X) = 0, indicating that nothing is known about the random variable, the pdf prescribed by Eq (10.85) is the discrete uniform distribution, and by Eq (10.89), the continuous uniform distribution.
2. If G(X) = X, so that the supplied information is the mean value, for discrete X, the pdf prescribed by Eq (10.85) is the geometric distribution;
for continuous X, the pdf prescribed by Eq (10.89) is the exponential
distribution.
3. If X is continuous and
G(X) = (X )2

(10.91)

then the pdf prescribed by Eq (10.89) is the Gaussian distribution.

Information, Entropy and Probability Models

10.5.2

353

Multiple Expectations

In the event that information is available about the expectations of m

functions Gj (X); j = 1, 2, . . . , m, of the random variable, i.e.
n

Gj (xi )f (xi ) = j ; j = 1, 2, . . . , m;

(10.92)

Gj (x)f (x)dx = j ; j = 1, 2, . . . , m;

(10.93)

i=1

for discrete X, or

for continuous X, so that in each case, j are known constants, then the
Lagrangians are obtained as
n

n
m

(f ) =
[ ln f (xi )]f (xi )
j
f (xi )Gj (xi ) j
(10.94)

i=1

j=1

(f ) =

[ ln f (x)]f (x)dx

i=1

j=1

f (x)Gj (x)dx j

(10.95)

It can be shown that these Lagrangians are maximized for pdfs given by:
m

f (xi ) = Ce[
for discrete X, and

j=1

f (x) = Ce[

j=1

j Gj (xi )]

j Gj (x)]

(10.96)

(10.97)

for continuous X, generalizing the results in Eqs (10.85) and (10.89). These results are from a theorem by Boltzmann (18441906), the Austrian theoretical
physicists credited with inventing statistical thermodynamics and statistical
mechanics. The constant
C in
each case is the normalizing constant deter
mined such that f (x)dx and
f (xi ) equal 1; the m Lagrange multipliers
1 , 2 , . . . , m are obtained from solving simultaneously the m equations representing the known expectations in Eqs (10.92) and (10.93).
The following are two applications of this set of results. Consider the case
where, for a continuous random variable X,
G1 (X) = X

(10.98)

G2 (X) = ln X

(10.99)

E[G1 (X)] = ; > 0

()
E[G2 (X)] =
()

(10.100)

and

(10.101)

354

Random Phenomena

then the pdf prescribed by Eq (10.97) is:

f (x) = Ce[1 x2 ln x] = Cx2 e1 x

(10.102)

and upon evaluating the constants, we obtain:

f (x) =

1 x 1
e x
()

(10.103)

recognizable as the pdf for the Gamma random variable.

For another continuous random variable X, with
G1 (X) = ln X

(10.104)

G2 (X) = ln(1 X)

(10.105)

and
E[G1 (X)] =
E[G2 (X)] =

() ( + )

()
( + )

() ( + )

()
( + )

(10.106)
(10.107)

then the pdf prescribed in Eq (10.97) is:

f (x) = Ce[1 ln x2 ln(1x)] = Cx1 (1 x)2

(10.108)

and upon evaluating the constants, we obtain:

f (x) =

( + ) 1
x
(1 x)1
()()

(10.109)

again recognizable as the pdf for a Beta random variable.

10.6

Summary and Conclusions

This chapter has been concerned with the problem of how to determine
appropriate (and complete) probability models when only partial information is available about the random variable in question. The rst-principles
approach to probability model development discussed in the earlier chapters
(Chapter 8 for discrete random variables and Chapter 9 for the continuous
type), is predicated on the availability of complete phenomenological information about the random variable of interest. When this is not the case, and only
partial information is available, model development must be approached differently. This chapter has oered such an alternative approachone based on
the maximum entropy principle. The essence of this principle is that of all

Information, Entropy and Probability Models

355

the several pdfs whose characteristics are consistent with the available partial
information, the one that adds the least amount of extraneous information
the least presumptiveshould be chosen as an appropriate model. To realize
this intuitively appealing concept fully in practice, of course, requires advanced optimization theory, much of which is not expected to be familiar to
the average reader. Still, enough of the derivation details have been presented
to allow the reader to appreciate how the results came to be.
It is interesting to note now in retrospect that all the results presented
in this chapter have involved familiar pdfs encountered previously in earlier
discussions. This should not give the impression that these are the only useful
maximum entropy distributions; neither should this be construed as implying
that all pdfs encountered previously have maximum entropy interpretations.
The scope of coverage was designed rst to demonstrate to (and inspire condence in) the reader that this approach, even though somewhat esoteric,
does in fact lead to results that make sense. Secondly, this coverage was
also designed to oer a dierent perspective of some of these familiar models.
For instance, as a model for residence time distribution in chemical reaction
engineering, we have seen the exponential distribution arise from chemical
engineering arguments (Chapter 2), probability arguments (Chapter 9), and
now from maximum entropy considerations. The same is true for the geometric
distribution as a model for polymer chain length distribution (see Application
Problem 10.12). But the application of this principle in practice extends well
beyond the catalog of familiar results shown here, for example, see Phillips et
al., (2004)3 for an application to the problem of modeling geographic distributions of species, a critical problem in conservation biology.
With the discussion in this chapter behind us, we have now completed
our study of probability models and their development. The discussion in the
next chapter is a case study illustrating how probability models are developed,
validated and applied in solving the complex and important practical problem
of optimizing the eectiveness of in-vitro fertilization.
The main points and results of this chapter are summarized in Table 10.1.

REVIEW QUESTIONS
1. What are the three axioms employed in quantifying the information content of
the statement P (X = xi ) = pi ?
2. In what ways are the axioms of information content akin to the axioms of
probability encountered in Chapter 4?
3. What is the entropy of a discrete random variable X with a pdf f (x)?
3 S. J. Phillips, M. Dud
k and R. E. Schapire, (2004) A Maximum Entropy Approach to
Species Distribution Modeling, Proc. Twenty-First International Conference on Machine
Learning, 655-662.

356

Random Phenomena

TABLE 10.1:

Summary of
Known
Random Variable
Characteristics
Discrete
Binary (0,1)
Discrete
Range: i = 1, 2, . . . , k
Discrete
Mean,
Continuous
Range: a x b
Continuous
Mean,
Continuous
Mean, ; Variance, 2
Continuous
Mean, ; Variance, 2
Range: 0 x 1
G(X) = 0; a x b
E[G(X)] = 0
G1 (X) = X; G2 (X) = ln X
E[G1 (X)] =

maximum entropy probability models

Maximum
Probability
Entropy
Model
Distribution
Bernoulli
f (0) = 0.5
Bn(0.5)
f (1) = 0.5
Uniform
f (xi ) = k1
UD (k)
Geometric
f (x) = p(1 p)x1
G(p); p = 1
Uniform
1
U (a, b)
f (x) = (ba)
Exponential
E(); = 1
f (x) = 1 ex/
Gaussian

2
N (, 2 );
f (x) = 12 exp (x)
2
2
Beta
(+) 1
B(, );
f (x) = ()()
x
(1 x)1
Uniform
U (a, b)
Gamma
(, 1)

f (x) =

1
(ba)

f (x) =

1
x 1
x
() e

f (x) =

(+) 1
(1
()() x

()
E[G2 (X)] = ()
G1 (X) = ln X
G2 (X) = ln(1 X)

()
E[G1 (X)] = ()

E[G2 (X)] =

()
()

(+)
(+)
(+)
(+)

Beta
B(, );

x)1

Information, Entropy and Probability Models

357

4. Why is entropy as dened for discrete random variables not very useful for continuous random variables?
5. What is the corresponding entropy concept for continuous random variables?
6. What is quantization and why must there always be a trade-o between quantization accuracy and entropy of a continuous random variable?
7. What is the dierential entropy of a continuous random variable?
8. What is the entropy of a random variable about which nothing is known?
9. What is the entropy of a variable with no uncertainty?
10. What eect does any additional information about a random variable have on
its entropy?
11. Provide a succinct statement of the primary problem of this chapter.
12. What is the maximum entropy principle for determining full pdfs when only
partial information is available?
13. What is the maximum entropy distribution for a discrete random variable
X for which the only known fact is that it can take on any of k discrete values
x1 , x2 , . . . , x k ?
14. What is the maximum entropy distribution for a discrete random variable X for
which the mean is known and nothing else?
15. What is the maximum entropy distribution for a continuous random variable X
for which the only known fact is its range, VX = {x : a < x < b}?
16. What is the maximum entropy distribution for a continuous random variable X
for which the mean is known and nothing else?
17. What is the maximum entropy distribution for a continuous random variable X
for which the mean, , and variance, 2 , are known and nothing else?
18. What is the maximum entropy distribution for a continuous random variable
X for which the range, (0,1), mean, , and variance, 2 , are known and nothing else?
19. Which two equations arise from a theorem of Boltzmann and how are they used
to obtain maximum entropy distributions?

358

Random Phenomena

EXERCISES
10.1 Using the principles of dierential calculus, establish that the entropy of the
Bernoulli random variable, shown in Eq (10.11), i.e.,
H(X) = (1 p) log 2 (1 p) p log2 p
is maximized when p = 0.5.
10.2 Determine the maximum entropy distribution for the Binomial random variable X, the total number of successes in n Bernoulli trials, when all that
is known
is that with each trial, there are exactly only two outcomes. (Hint: X = n
i=1 Xi ,
where Xi is a Bernoulli random variable.)
10.3 Determine the entropy for the geometric random variable, G(p), with the pdf
f (x) = pq x1 ; x = 1, 2, . . .
and compare it to the entropy obtained for the Bernoulli random variable in Example 10.3 in the text.
10.4 Show that the entropy for the exponential random variable, E (), with pdf
f (x) =

1 x/
e
;0 < x <

is given by:
H(X) = 1 + ln

(10.110)

10.5 Show that the entropy for the Gamma random variable, (, ), with pdf
f (x) =

1
ex/ x1 ; 0 < x <
()

is given by:
H(X) = + ln + ln () + (1 )

()
()

(10.111)

Directly from this result, write an expression for the entropy of the 2 (r) random
variable.
10.6 Show that the entropy for the Gaussian N (, 2 ) random variable with pdf
f (x) =
is given by:

(x)2
1
e 22
2

H(X) = ln 2e

(10.112)

and hence establish that the entropy of a Gaussian random variable depends only
on and not . Why does this observation make sense? In the limit as ,
what happens to the entropy, H(X)?

Information, Entropy and Probability Models

359

10.7 Show that the entropy for the Lognormal L(, 2 ) random variable with pdf

(ln x )2
1
exp
f (x) =
;0 < x <
2 2
x 2
is given by:

H(X) = ln 2e +

(10.113)

Compare this with the expression for the entropy of the Gaussian random variable in
Exercise 9.4, Eq (10.112). Why does the entropy of the Lognormal random variable
depend linearly on while the entropy of the Gaussian random variable does not
depend on the corresponding parameter, at all?
10.8 The maximum entropy distribution for a random variable X for which G(X)
and its expectation, E[G(X)], are specied, was given in Eq (10.85) for discrete X,
and Eq (10.89) for continuous X i.e.,

f (x) =

CeG(xi ) ;
CeG(x) ;

for discrete X
for continous X

Determine f (x) completely (i.e., determine C and explicitly) under the following
conditions:
(i) G(Xi ) = 0; i = 1, 2, . . . , k, a discrete random variable for which nothing is known
except its range;
(ii) G(X) = 0; a < X < b;
(iii) G(Xi ) = X; i = 1, 2, . . .; E[G(X)] =
(iv) G(X) = X; 0 < x < ; E[G(X)] =
10.9 For the continuous random variable X for which
G(X) = (X )2
is specied along with its expectation,
E[G(X)] = E[(X )2 ] = 2 ,
the maximum entropy distribution was given in Eq (10.89) as:
f (x) = CeG(x) ;
Show that the constants in this pdf are given by:
C

2
1
2 2

(10.114)
(10.115)

and thus establish the result that:

the maximally unpresumptive distribution for a random variable X for
which only the mean, , and variance, 2 , are legitimately known, is the
Gaussian pdf:

360

Random Phenomena

1
(x )2
exp
2 2
2
You may nd the following identity useful:
)

2

eau du =
a

(10.116)

f (x) =

(10.117)

10.10 Given the following information about a continuous random variable, X,

G1 (X) = X; and G2 (X) = ln X
along with
E[G1 (X)] = ; > 0; and E[G2 (X)] =

()
()

it was stated in the text that the maximum entropy pdf prescribed by Eq (10.97) is:
f (x) = Ce[1 x2 ln x] = Cx2 e1 x

(10.118)

Determine the constants C, 1 and 2 and hence establish the result given in Eq
(10.103).
10.11 Revisit Exercise 10.10 for the case where the information available about the
random variable X is:

X
X
G1 (X) = ; and G2 (X) = ln

along with
()
E[G1 (X)] = ; > 0; and E[G2 (X)] =
()
obtain an explicit expression for the maximum entropy pdf in this case.

APPLICATION PROBLEMS
10.12 In certain polymerization processes, the polymer product is made by the
sequential addition of monomer units to a growing chain. At each step, after the
chain has been initiated, a new monomer may be added, propagating the chain,
or a termination event can occur, stopping the growth; whether the growing chain
propagates or terminates is random. The random nature of the propagation and
termination events is responsible for polymer products having chains of variable
lengths. As such, because it is a count of the number of monomer units in the chain,
X, the length of a particular polymer chain, is a discrete random variable.
Now consider the case where the only information available about a particular
process is the kinetic rate of the termination reaction, given as RT per min, which
can be interpreted as implying that an average of RT chain terminations occur per
min. By considering the reciprocal of RT , i.e.,
p=

1
RT

as the probability that a termination reaction will occur, obtain f (x), a maximum

Information, Entropy and Probability Models

361

entropy distribution for the polymer chain length, in terms of p. This distribution
is often known as the most probable chain length distribution.
10.13 As introduced very briey in Chapter 2, the continuous stirred tank reactor
(CSTR), a ubiquitous equipment used in the chemical industry to carry out a wide
variety of chemical reactions, consists of a tank of volume V liters, through which
the reactant stream ows continuously at a rate of F liters/sec; the content is vigorously stirred to ensure uniform mixing, and the product is continuously withdrawn
from the outlet at the same rate, F liters/sec. Because of the vigorous mixing, the
amount of time any particular uid element spends in the reactorthe reactor residence timevaries randomly, so that there is in fact not a single value for residence
time, X, but a distribution of values. Clearly, the residence time aects the productivity of the reactor, and characterizing it is a central concern in chemical reaction
engineering.
Now, a stream continuously fed at a rate F liters/sec through a reactor of volume
V liters implies an average residence time, in secs, given by
=

V
F

Given only this information, obtain a maximum entropy distribution for the residence time in the CSTR, and compare it with the result in Section 2.1.2 of Chapter 2.
10.14 Integrins are transmembrane receptors that link the actin cytoskeleton of a
cell to the extra cellular matrix (ECM). This connection, which constantly and dynamically reorganizes in response to mechanical, chemical, and other environmental
cues around the cell, leads to lateral assembly of integrins into small stationary focal complexes or clusters of integrins. Integrin clustering, an extremely important
process in cell attachment and migration, is a stochastic process that results in heterogenous populations of clusters that are best characterized with distributions. One
of the many characteristics of an integrin cluster is its shape. Because integrin clusters grow or shrink in dierent directions depending on the orientation and tension
of the actin cytoskeleton, the shape of an integrin cluster provides useful information
concerning the forces acting on a particular adhesive structure.
The shape of integrin clusters is often idealized as an ellipse and quantied by
its eccentricity, , the ratio of the distance between the foci of the representative
ellipse to the length of its major axis. This quantity has the following properties:
1. It is scaled between 0 and 1, i.e., 0 1;
2. 1 for elongated clusters; for circular clusters, 0;
3. Physiologically, integrin clusters in adherent cells tend to be more elongated
than circular; non-adherent cells tend to have more circular integrin clusters.
Given that the average and variance of cluster eccentricity is often known for a particular cell (and, in any event, these can be determined experimentally), obtain a
maximum entropy distribution to use in representing this aspect of integrin clustering.
Data obtained by Welf (2009)4 from Chinese Hamster Ovary (CHO) cells stably expressing the integrin IIb3, indicated an average eccentricity = 0.92 and
4 Welf, E. S. (2009). Integrative Modeling of Cell Adhesion Processes, PhD Dissertation,
University of Delaware.

362

Random Phenomena

variance, 2 = 0.003 for the particular collection of integrin clusters studied. From
this information, obtain a specic theoretical pdf that characterizes the size distribution of this experimental population of integrin clusters and determine the mode
of the distribution. Plot the pdf; and from the shape of this distribution, comment
on whether you expect the specic clusters under study to belong to adherent or
non-adherent cells.
10.15 Mee (1990)5 presented the following data on the wall thickness (in ins) of
cast aluminum cylinder heads used in aircraft engine cooling jackets. The mean and
variance of the wall thickness are therefore considered as known. If a full pdf is to
be prescribed to characterize this important property of the manufactured cylinder
heads, use the maximum entropy principle to postulate one. Even though there are
only 18 data points, plot the theoretical pdf versus a histogram of the data and
comment on the model t.
0.223
0.201

0.228
0.223

0.214
0.224

0.193
0.231

0.223
0.237

0.213
0.217

0.218
0.204

0.215
0.226

0.233
0.219

10.16 The total number of occurrences of a rare event in an interval of time (0, T ),
when the event occurs at a mean rate, per unit time, is known to be a Poisson
random variable. However, given that an event has occurred in this interval, and
without knowing exactly when the event occurred, determine a maximum entropy
distribution for the time of occurrence of this lone event within this interval.
Now let X be the time of occurrence in the normalized unit time interval, (0,1).
Using the just-obtained maximum entropy distribution, derive the distribution for
the log-transformed variable,
1
(10.119)
Y = ln X

Interpret your result in terms of what you know about the Poisson random variable
and the inter-arrival times of Poisson events.

5 Mee, R. W., (1990). An improved procedure for screening based on a correlated, normally distributed variable, Technometrics, 32, 331337.

Chapter 11
Application Case Studies II: In-Vitro
Fertilization

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2 In-Vitro Fertilization and Multiple Births . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2.1 Background and Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2.2 Clinical Studies and Recommended Guidelines . . . . . . . . . . . . . . . . . . .
Factors aecting live-birth and multiple-birth rates . . . . . . . . . . . .
Prescriptive Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Determining Implantation Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Qualitative Optimization Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3 Probability Modeling and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3.1 Model Postulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4 Binomial Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4.1 Overview and Study Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4.2 Binomial Model versus Clinical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.5 Problem Solution: Model-based IVF Optimization and Analysis . . . . . . . .
11.5.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.5.2 Model-based Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.5.3 Patient Categorization and Theoretical Analysis of Treatment
Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.6 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.6.1 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.6.2 Theoretical Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7.1 Final Wrap-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7.2 Conclusions and Perspectives on Previous Studies and Guidelines
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PROJECT ASSIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

364
365
365
367
367
368
368
369
371
371
372
373
375
375
377
384
385
386
387
392
392
394
395
395
397
398
399

It is a test of true theories

not only to account for but to predict phenomena
William Whewell (17941866)

The mathematician, Tobias Dantzig (18841956) noted with pitch-perfect

precision in his captivating book, Number, may be compared to a designer
of garments, who is utterly oblivious of the creature whom his garments may
t. To be sure, his art originated in the necessity for clothing such creatures,
but this was long ago; to this day a shape will occasionally appear which will
t into the garment as if the garment had been made for it. Then there is no
end of surprise and of delight!
Such a shape appears in this chapter in the form of in-vitro fertilization
363

364

Random Phenomena

(IVF), an iconic 20th century creature of whose future existence the designers of the garments of probability distributions (specically, the binomial
distribution) were utterly oblivious a few centuries ago. Our surprise and delight do not end upon discovering just how well the binomial pdf model ts,
as if custom-made for IVF analysis; there is the additional and completely
unexpected bonus discovery that, with no modication or additional embellishments, this model also is perfectly suited to solving the vexing problem of
maximizing the chances of success while simultaneously minimizing the risk
of multiple births and total failure. This chapter is the second in the series
of case studies designed to illustrate how the probabilistic framework can be
used eectively to solve complex, real-life problems involving randomly varying phenomena.

11.1

Introduction

When Theresa Anderson learned that she was pregnant with quintuplets,
she was dumbfounded. She had signed up to be a surrogate mother for an
infertile couple and during an in vitro fertilization procedure doctors introduced
ve embryos into her womb. They told me there was a one in 30 chance that
one would take, she says. Instead, all ve took. The 26-year-old mother of
two endured a dicult pregnancy and delivered the boys in April [2005] at a
Phoenix hospital. The multiple births made headlines across the country as a
feel-good tale. But they also underscore a reality of the fertility business: Many
clinics are implanting more than the recommended number of embryos in their
patients, raising the risks for women.
So began an article by Sylvia Pag
an Westphal that appeared in the Wall
Street Journal (WSJ) on October 7, 2005. In-vitro fertilization (IVF), the very
rst of a class of procedures collectively known as Assisted Reproductive
Technology (ART), was originally developed specically to treat infertility
caused by blocked or damaged fallopian tubes; it is now used to treat a variety of infertility problems, with impressive success. With IVF, eggs and sperm
are combined in a laboratory to fertilize in vitro (literally in glass). The
fertilized eggs are later transferred into the womans uterus, where, in the successful cases, implantation, embryo development, and ultimately, a live birth,
will occur as with all other normal pregnancies. Since 1978 when the rst
so-called test-tube baby was born, IVF has enabled an increasing number
of otherwise infertile couples to experience the joy of having children. So successful in fact has IVF been that its success rates now compare favorably to
natural pregnancy rates in any given month, especially when the woman is
under 40 years of age and there are no sperm problems.
With the advent of oocyte donation (Yaron, et al, 1997; Reynolds, et al,
2001), where the eggs used for IVF have been donated typically by younger

Application Case Studies II: In-Vitro Fertilization

365

women, the once formidable barriers to success due to age or ovarian status
are no longer as serious. Today, ART with oocyte donation is one of the most
successful treatment programs for infertility. Recent studies have reported
pregnancy rates as high as 48% per retrieval: out of a total of 6,936 IVF
procedures using donated oocytes carried out in 1996 and 1997, 3,320 resulted
in pregnancies and 2,761 live-birth deliveriesan astonishing success rate (in
terms of deliveries per retrieval) of 39.8% (Reynolds, et al., 2001).
However, as indicated by the WSJ article, and a wide variety of other
clinical studies, including the Reynolds, et al., 2001, study noted above, IVF
patients are more likely to have multiple-infant births than women who conceive naturally; furthermore, these multiple pregnancies are known to increase
the risks for a broad spectrum of problems, ranging from premature delivery
and low birth weight, to such long-term disabilities as cerebral palsy, among
surviving babies. For example, Patterson et al, 1993 report that the chance of
a twin pregnancy resulting in a baby with cerebral palsy is 8 times that of a
singleton birth.
The vast majority of pregnancies with three or more babies are due to
IVF and other such assisted reproductive technologies (fewer than 20 %
arise from natural conception); and such multiple births contribute disproportionately to infant and maternal morbidity and mortality rates, with corresponding increased contributions to health care costs. Consequently, many
national and professional organizations in the U.S., Canada and other western
countries have provided guidelines on the number of embryos to transfer, in
an attempt to balance the desire for success against the risk of multiple births.
The primary objective of this chapter is to examine the fundamental problem of multiple births in IVF from a probabilistic perspective. In what follows, rst we review a few representative clinical studies and recommended
IVF practice guidelines, and then develop a probability model for IVF and
validate it against clinical data. Finally, with the probability model as the basis, we pose and then solve the optimization problem of maximizing the
chances for success while simultaneously reducing the risk of multiple births.
The various consensus qualitative recommendations and guidelines are then
interpreted in the context of the probability model and the optimal solution
obtained from it.

11.2
11.2.1

In-Vitro Fertilization and Multiple Births

Background and Problem Denition

The primary issue confronting fertility physicians (and patients undergoing

IVF treatment) is twofold:
Which embryos should be selected for transfer, and

366

Random Phenomena

How many embryos should be transferred.

These questions remain exceedingly dicult to answer because of an intrinsic characteristic of IVF treatment: uncertainty. Whether any particular
embryo will implant and develop into pregnancy and ultimately lead to the
birth of a healthy child is fundamentally uncertain. The reasons for this fact
are many, ranging from egg factors such as the source of the egg (donor, self,
fresh, cryopreserved, etc), embryo morphology, stimulation protocols, laboratory specications; and uterine factors such as patient age, medical history,
and other determinants of uterus status. Thus, in addition to the uncertainties
associated with the implantation potential of each individual candidate embryo, there are also uncertainties associated with gestation, fetal development
and nally childbirth.
While normal fertile couples are not entirely immune to many of these uncertainties, IVF patients, by denition, are particularly prone to poor prognosis so that, from the very start, the chances of successful treatment are
typically not very high. To improve the odds of success and ensure an acceptable pregnancy and live birth delivery rate, the prevalent strategy has been
to implant multiple embryos. However, this increase in success rate occurs
simultaneously with the undesirable consequence of increased risk of multiple
pregnanciesalong with the associated problems, and consequent implications
for health care costs.
Clearly then, the dening problem of modern IVF practice is how to balance the risks of multiple births judiciously against the desire to increase the
chances that each treatment cycle will be successful an optimization problem involving the determination, for each patient, in any particular IVF cycle,
the optimum number of embryos to transfer to maximize the chances of a singleton live birth while simultaneously reducing the risk of multiple births.
A multitude of clinical studies have been conducted in an attempt to nd
a practical, implementable solution to this problem; and the essence of the
current state-of-the-art is captured by the following summary statements
which precedes a recent set of guidelines published in 2006 in the J Obst
Gynecol Can:
The desired outcome of infertility treatment is the birth of a
healthy child. As multifetal gestations are associated with higher
rates of morbidity and mortality, their disproportionately high
occurrence after IVF-ET [in vitro fertilization-embryo transfer]
should be minimized. The transfer of fewer embryos per attempt
should be employed as primary prevention. However, indiscriminate application of limitations upon the number of embryos transferred would be inappropriate until accurate predictors of successful implantation can be determined. Decisions on the number of
embryos to transfer should be based upon prognosis determined
by variables including the womans age, prior outcomes, and the
number and quality of embryos available for transfer, and should

Application Case Studies II: In-Vitro Fertilization

367

be made to minimize the risk of multifetal gestation while maintaining a high probability of healthy live birth.1
The extent of the results of these studies (reviewed shortly) is a collection of
sound recommendations, based on careful analyses of specic clinical data sets
to be sure, but no explicit solution to the optimization problem. To the best
of our knowledge, to date, there is in fact no systematic, explicit, quantitative
solution to the IVF optimization problem whereby the optimum number of
embryos to transfer can be prescribed concretely for each individual patient.
Such a solution is developed in this chapter.

11.2.2

Clinical Studies and Recommended Guidelines

The literature on the topic of Assisted Reproduction, even when restricted to papers that focus explicitly on the issue of IVF and multiple births,
is quite extensive. This fact alone makes an exhaustive review next to impossible within the context of this chapter. Nevertheless, it is possible to discuss a
few key papers that are particularly relevant to the objectives of this chapter
(the application of probability models to problems of practical importance).
Factors aecting live-birth and multiple-birth rates
The rst group of papers, exemplied by Schieve et al., 1999; Engmann,
et al., 2001; Reynolds et al, 2001; Jansen, 2003; and Vahratian, et al., 2003,
use retrospective analyses of various types of clinical IVF data to determine
what factors inuence live-birth rates and the risk of multiple births. The main
conclusions in each of these studies were all consistent and may be summarized
as follows:
1. Patient age and the number of embryos transferred independently affected the chances for live birth and multiple birth.
2. In general, live-birth rates increased if more than 2 embryos are transferred.
3. The number of embryos needed to achieve maximum live birth rates
varied with age. For younger women (age < 35 years), maximum livebirth rates were achieved with only two embryos transferred; for women
age > 35 years, live birth rates were lower in general, increasing if more
than 2 embryos were transferred.
4. Multiple birth rates generally increased with increased number of embryos transferred but in an age-dependent manner, with younger women
(age < 35 years) generally showing higher multiple-birth risks than older
women.
1 Guidelines for the Number of Embryos to Transfer Following In Vitro Fertilization, J
Obstet Gynaecol Can 2006;28 (9)799-813

368

Random Phenomena

5. Special Cases: For IVF treatments using donor eggs, the age of the donor
rather than maternal age was more important as a determinant of the
risk of multiple-birth (Reynolds, et al. 2001). Also, success rates are
lower in general with thawed embryos than with fresh ones (Vahratian,
et al.,2003).
These conclusions, supported concretely by clinical data and rigorous statistical analyses, are, of course, all perfectly in line with common sense.
Prescriptive Studies
The next group, exemplied by Austin, et al, 1996; Templeton and Morris, 1998; Strandel et al, 2000; and Thurin et al., 2004, are more prescriptive
in that each in its own way sought to provide explicit guidancealso from
clinical dataon the number of embryos to transfer in order to limit multiple
births. A systematic review in Pandian et al., 2004 specically compares the
eectiveness of elective two-embryo transfer with single-embryo transfer and
transfers involving more than 2 embryos. The Thurin et al., 2004, study is
unique, being representative of a handful of randomized prospective studies
in which, rather than analyze clinical data after the fact (as with other
studies), they collected their own data (in real-time) after assigning the treatment applied to each patient (single-embryo transfer versus double-embryo
transfer) in randomized trials. Nevertheless, even though these studies were
all based on dierent data sets from dierent clinics, utilized dierent designs, and employed dierent methods of analysis, the conclusions were all
remarkably consistent:
1. The risk of multiple births increased with increasing number of (good
quality) embryos transferred, with patients younger than 40 at higher
risk;
2. The rate of multiple births can be reduced signicantly by transferring
no more than two embryos;
3. By performing single-embryo transfers (in selected cases), the rate of
multiple births can be further reduced, although at the expense of reduced rate of live births;
4. Consecutive single-embryo transfers (one fresh embryo transfer
followedin the event of a failure to achieve term pregnancyby one
additional frozen-and-thawed embryo transfer) achieves the same signicant reduction possible with single-embryo transfer without lowering the
rate of live births substantially below that achievable with a one-time
double-embryo transfer.

Application Case Studies II: In-Vitro Fertilization

369

Determining Implantation Potential

Central to the results of these prescriptive studies is an underlying recognition that, as an aid in rational decision-making, a reasonable estimate of the
quality of each transferred embryo, (or equivalently, its implantation potential) is indispensable. This is especially so for elective single-embryo transfers
where, for obvious reasons, success requires using the best quality embryos
with the highest implantation potential.
Quantitatively determining the implantation potential of embryos clearly
remains a very dicult proposition, but Bolton, et al., 1989, and Geber and
Sampaio, 1999, have proposed techniques for carrying out this task, specically
to facilitate the process of embryo selection for maximizing the success rate of
IVF treatment. Thus, while non-trivial, it is indeed possible to determine the
clinical chances that the transfer of a single embryo will lead to a livebirth.
This fact is important for the model-based analysis that follows shortly.
Qualitative Optimization Studies
Deciding on the appropriate number of embryos to transfer in each IVF
cycle is, as noted earlier, really an optimization problem because the fundamental issue boils down to maximizing IVF success rates while simultaneously
minimizing the risk of multiple births, objectives that are fundamentally conicting. Very few studies have taken such an explicitly technical optimization perspective of the problem. In Yaron, et al., 1997, for example, the
authors present a retrospective study of 254 oocyte-donation IVF patients.
Even though the word optimal appears in the papers title, the text of the
paper itself contains no more than a brief discussion at the tail end of a collection of suggestions arising from the data analysisthere is no optimization
(formal or informal), and no concrete prescription one way or another in the
paper.
In Combelles, et al., 2005, the authors specically focus on IVF patients
older than 40 years of age; and through a retrospective study of data on 863
patients covering a period of more than 5 years the authors arrived at the
conclusion that for this group of patients, the optimum number of embryos to
transfer is 5. The nding was based on the following results from their data
analysis:
1. Transferring 5 or more embryos resulted in signicantly increased pregnancy and live birth rates compared with transferring fewer than 5;
2. Transferring more than 5 embryos did not confer any signicant additional clinical outcome.
We revisit these results later.
The retrospective study reported in Elsner et al., 1997, involving a large
population of patients, 2,173 in all, from the clinic operated by the authors,
includes an extensive qualitative discussion regarding how best to maximize

370

Random Phenomena

IVF success rates without unduly increasing the risk of multiple births. The
key conclusions of the study are:
1. Ideally, a single embryo transfer would be optimal if the implantation
rates (per embryo) were as high as 50%;
2. Embryos should be screened, and only the few with high potential implantation rates should be selected for transfer.
3. No more than two embryos should be transferred per attempt. To oset the potential lowering of the IVF success rate, the rest should be
cryopreserved for subsequent frozen-thaw embryo transfer.
Because it is comprehensive and detailed, but especially because its presentation is particularly appropriate, the results from this study are used to
validate the model presented in the next section.
The nal category to discuss are the guidelines and policy recommendations developed by professional organizations and various governmental agencies in western nations particularly the US, Canada, the United Kingdom, and
Sweden. For example, in 1991, the British Human Fertilisation and Embryology Authority (HFEA) imposed a legal restriction on the number of allowable
embryos transferred to a maximum of 3; Sweden, in 1993 recommended a further (voluntary) reduction in the number of embryos transferred from 3 to 2.
The American Society of Reproductive Medicine recommended in 19992 that
no more than two embryos should be transferred for women under the age
of 35 who produce healthy embryos; three for those producing poor embryos.
A further tightening of these (voluntary) recommendations in 2004 now suggests that women younger than 35 years old with good prognoses consider
single-embryo transfers with no more than two embryos only under extraordinary circumstances. For women aged 35-37 years the recommendation is
two embryos for those with good prognoses and no more than 3 for those with
poorer prognoses. The Canadian guidelines issued in 2006 referred to earlier
in section 11.2.1 are similar, but more detailed and specic. Because they
are consistent with, and essentially capture and consolidate all the results of
the previously highlighted studies into a single set of cohesive points, the key
aspects are presented below:
1. Individual IVF-ET (embryo transfer) programs should evaluate their
own data to identify patient-specic, embryo-specic, and cycle-specic
determinants of implantation and live birth in order to develop embryo
transfer policies that minimize the occurrence of multifetal gestation
while maintaining acceptable overall pregnancy and live birth rates.
2. In women under the age of 35 years, no more than two embryos should
2 American Society of Reproductive Medicine. Guidelines on number of embryos transferred. Birmingham, Alabama, 1999

Application Case Studies II: In-Vitro Fertilization

371

be transferred in a fresh IVF-ET cycle. In women under the age of 35

years with excellent prognoses, the transfer of a single embryo should be
considered.
3. In women aged 35 to 37 years, no more than three embryos should be
transferred in a fresh IVF-ET cycle. In those with high-quality embryos
and favorable prognoses, consideration should be given to the transfer
of one or two embryos in the rst or second cycle.
4. In women aged 38 to 39 years, no more than three embryos should be
transferred in a fresh IVF-ET cycle. In those with high-quality embryos
and favorable prognoses, consideration should be given to the transfer
of two embryos in the rst or second cycle.
5. In women over the age of 39 years, no more than four embryos should
be transferred in a fresh IVF-ET cycle.
6. In exceptional cases when women with poor prognoses have had multiple
failed fresh IVF-ET cycles, consideration may be given to the transfer
of more embryos than recommended above in subsequent fresh IVF-ET
cycles.
We now develop a theoretical probability model of IVF, and validate it
against the Elsner et al, clinical data. The validated model is then used to
provide an explicit quantitative expression for determining the theoretical
optimum number of eggs to implant. Finally the model results are compared
to those from the just-reviewed clinical studies.

11.3
11.3.1

Probability Modeling and Analysis

Model Postulate

Consider the following central characteristics of the IVF process:

The fertilization and transfer of each embryo ultimately either results in
a live birth (considered a success), or not;
Which of these two mutually exclusive outcomes is the nal result of the
transfer of a single embryo is uncertain, with such factors as the nature
and quality of the embryo, the patient age, other indicators of uterine
condition, etc, jointly aecting the outcome in ways not easily quantied
explicitly (at least with currently available technology);
The transfer of n embryos at once in one IVF treatment cycle is tantamount to n simultaneous attempts, with the primary objective of im-

372

Random Phenomena
proving the chances that at least one embryo will implant and lead to a
live birth;

How many (and which ones) of the n transferred embryos will ultimately
lead to live births is also uncertain.
If the transfer of n embryos can be considered as n independent
(Bernoulli) trials under identical conditions; and if the overall eect of the collection of factors that inuence the ultimate outcome of each single trial
the transfer of a single embryois captured in the parameter p representing
the probability that a particular single embryo will lead to a successful pregnancy; then observe that X, the number of live births in a delivered pregnancy
following an IVF treatment cycle involving the transfer of n embryos, is a binomial random variable whose pdf is as given in Chapter 8, i.e.:

n x
(11.1)
f (x) =
p (1 p)nx
x
an expression of the probability of obtaining x live-born babies from n embryos. When x = 1, the live birth is said to result in a singleton the most
desirable outcome; a multiple-birth is said to occur when x = 2 (fraternal
twins), or 3 (triplets), or 4 (quadruplets), . . . , etc. up to and including n. How
this postulated model matches up with real clinical data is examined shortly.
The characteristics of the binomial random variable and its model have
been discussed in Chapter 8 and the reader may wish to pause at this point to
review these. Within the context of IVF, the parameter p has a very specic
physiological interpretation: it is what is referred to in Jansen, 2003, as a
womans total chance for a live birth from one retrieval. We will refer to it in
the rest of this chapter as the single embryo probability of success (or SEPS)
parameter. It is sometimes referred to as the embryo implantation potential
in the ART literature, indicative of its characteristic as a composite of both
embryo and uterine properties. If this parameter is known, even approximately
(see the discussion to follow about the sensitivity of the model results to the
degree of accuracy to which p is determined), then the mathematical model in
Eqn (11.1) allows us to carry out a wide variety of theoretical analyses regarding IVF, including outcome prediction, estimation (patient characterization),
and optimization.

11.3.2

Prediction

Consider, for example, a case where the combined patient/embryo conditions are characterized by the SEPS parameter p = 0.2 (indicating a 20%
chance of success for each embryo). The binomial model allows us to say
the following about the transfer of n = 5 embryos, for instance:
1. Because E(X) = np for the binomial random variable X, in this particular case, E(X) = 1, implying that the expected outcome of this IVF
treatment cycle is 1 live birth;

Application Case Studies II: In-Vitro Fertilization

373

TABLE 11.1:

Theoretical distribution of probabilities of possible

outcomes of an IVF treatment with 5 embryos transferred and p = 0.2
x
f (x)
(x)
Expected total no.
No. of live
births in a
Probability
of patients (out of 1000)
delivered pregnancy of occurrence with pregnancy outcome x
0
0.328
328
1
0.410
410
0.205
205
2
3
0.051
51
4
0.006
6
0.000
0
5

2. Since the theoretical variance, 2 = np(1 p) = 0.8, (so that the standard deviation, = 0.89), the general implication is that there is a
fair amount of variability associated with the expected outcomes in this
specic treatment scenario. In fact,
3. The full probability distribution can be computed as shown in Table
11.1, indicating a 32.8% chance that the IVF treatment will not succeed in producing a child, but a somewhat higher 41.0% chance of a
singleton; a 20.5% chance of twins, a 5.1% chance of triplets, and less
than 1% chance of quadruplets or quintuplets. A common alternative
interpretation of the indicated probability distribution is shown in the
last column: in a population of 1,000 identical patients undergoing
the same treatment, under essentially identical conditions, as a result
of the transfer of 5 embryos to each patient, 328 patients will produce
no live births, 410 will have singletons, 205 will have twins, 51 will have
triplets, 6 will have quadruplets, and none will have quintuplets.
4. From this table we see that there is more than a 99% chance that 0 <
x < 3, with the following implication: while the expected outcome is
a singleton, the actual outcome is virtually guaranteed to be anything
from a complete failure to a triplet and everything in between; it is
highly unlikely to observe any other outcome.

11.3.3

Estimation

The practical utility of the binomial model for IVF clearly depends on
knowing the lone model parameter p that characterizes the probability of a
single embryo transfer leading to a successful live birth. In the absence of
reliable technology for determining an appropriate value directly from physiological measurements, this parameter value must then be determined from
clinical data, with best results when the data sets are generated from carefully
designed experiments.

374

Random Phenomena

Consider, for example, the following statement taken from the WSJ article
mentioned earlier:
In 1999, based on results from over 35,000 IVF treatments, the Centers
for Disease Control and Prevention reported that between 10% and 13%
of women under 35 who had three embryos introduced got pregnant with
triplets.

This statement translates as follows: for the women in this study, n = 3 and
0.1 < P (X = 3) < 0.13. From the binomial model in Eqn (11.1), with an
unknown p and n = 3, we know that
P (X = 3) = f (3) = p3

(11.2)

and upon substituting the limiting values of 0.1 and 0.13 for the probability
of obtaining triplets, we immediately obtain
p = [0.46, 0.51]

(11.3)

as the corresponding estimates of p. This assumes, of course, that this group of

women is reasonably homogeneous in the sense that, while not necessarily
identical, the relevant individual physiological characteristics are similar. The
women participating in this study are therefore characterized (on average) by
0.46 < p < 0.51, with the implication that for this category of women (under
the age of 35) there is a 46-51% chance of a single embryo leading to a live
birth, a relatively high IVF success rate.
More generally, one can use clinical data records of the following type:
(i) The patients: a cohort of Nn patients, each receiving the same number
of transferred embryos, n; (ii) The results: After the IVF treatment, n (1) is
the total number of singleton births, n (2) the total number of twins, or, in
general, n (x) is the total number of x-births (x = 3 for triplets; x = 4
for quadruplets, etc). Provided that all the patients in the cohort group are
similarly characterized with a common SEPS parameter p, then as discussed
fully in Part IV the maximum likelihood estimate of p for the group, (say
pn ) is given by:
pn

=
=

Total number of live births

Total number of transferred embryos
n
x=1 xn (x)
nNn

(11.4)
(11.5)

Thus, for example, one of the entries in the data set found in Table I
of Elsner, et al., 1997, indicates that 661 patients each received 3 embryos
resulting in 164 singletons, 74 twins, and 10 triplets, with no higher order
births. In our notation, n = 3, N3 = 661, 3 (1) = 164, 3 (2) = 74, and
3 (3) = 10, so that the estimate of p for this cohort is given by:
p =

164 + 2 74 + 3 10
= 0.172
3 661

(11.6)

Application Case Studies II: In-Vitro Fertilization

375

Note the very important assumption that all 661 patients in this cohort group
have identical (or at least essentially similar) characteristics. We shall
have cause to revisit this data set and these implied assumptions in the next
section.

11.4

Binomial Model Validation

Before proceeding to use the binomial model for IVF optimization, we wish
rst to validate the model against clinical data available in the literature.
Primarily because of how they are reported, the data sets presented in the
Elsner, et al., 1997, study (briey referred to in the previous subsection) are
structurally well-suited to the binomial model validation exercise. But this
was not a study designed for model validation, otherwise the design would
have required more control for extraneous sources of variability within each
cohort group. Nevertheless, one can still put these otherwise rich data sets to
the best use possible, as we now show.

11.4.1

Overview and Study Characteristics

The data sets in question are from a retrospective study of 2,173 patients
on which fresh and frozen-thawed embryo transfers were performed in the
authors own clinic over a 42-month period from September 1991 to March
1995. A total number of 6,601 embryos were transferred ranging from 1 to
6 embryos per transfer. Most importantly, the data are available for cohort
groups of Nn patients, receiving n = 1, 2, . . . , 6 embryos; and on n (x), the
number of patients with pregnancy outcome x (x = 1 for singletons, 2 for
twins, 3 for triplets, etc), presented separately for each cohort group, making
them structurally ideal for testing the validity of the binomial model. Table
11.2 shows the relevant data arranged appropriately for our purposes (by
cohort groups according to embryos received, from 1 through 6).
For each cohort group n = 1, 2, 3, 4, 5, 6, the estimates of the probability
of success are obtained from the data as p1 = 0.097; p2 = 0.163; p3 =
0.172; p4 = 0.149; p5 = 0.111; p6 = 0.125 for an overall probability of success
for the entire study group p = 0.154. Some important points to note:
These values are the same as the embryo implant value computed by
Elsner, et al.;
Although the overall group average is 0.154, the values for each cohort
group range from a low of 0.097 for those receiving a single embryo to a
high of 0.172 for those receiving 3 embryos.
As noted in the paper, the value 0.097 is signicantly lower than the

Elsner, et al. data of outcomes of a 42-month IVF treatment study

No. of patients receiving n = 1, 2, . . . 6 embryos
with pregnancy outcome x
T (x)
1 (x) 2 (x) 3 (x) 4 (x) 5 (x)
6 (x)
Total no. patients
with pregnancy
outcome x
205
288
413
503
28
2
1439
22
97
164
207
13
1
504
0
17
74
84
5
1
181
0
0
10
32
1
0
43
0
0
0
6
0
0
6
0
0
0
0
0
0
0
227
402
661
832
47
4
2173

x
Delivered
pregnancy
outcome
0
1
2
3
4
5
Total

TABLE 11.2:

376
Random Phenomena

Application Case Studies II: In-Vitro Fertilization

377

numbers computed for the 2-, 3-, and 4-, embryo cohort group (which
also means that it is signicantly lower than the overall group average
of 0.154). The implication of this last point therefore is that one cannot
assume a uniform value of p for the entire study involving 6,601 embryos; it also raises the question of whether even the computed pi for
each cohort group can be assumed to be uniform for the entire group
(especially groups with large numbers of embryos involved such as the
3- and 4- embryo cohort groups). This issue is addressed directly later.

11.4.2

Binomial Model versus Clinical Data

On the basis of the estimated group probabilities, pn , the binomial probability model for each group is obtained as in Eq (11.1):

n x
fn (x) =
p (1 pn )nx
(11.7)
x n
providing the probability of obtaining pregnancy outcome x = 0, 1, 2, . . . , 6, for
each cohort group receiving n embryos. Now, given Nn the number of patients
in each cohort group (referred to as the number of cycles in the original
paper) we can use the model to predict the expected number of patients
receiving n embryos that eventually have x = 0, 1, 2, . . . 6 as the delivered
pregnancy outcome, n (x), as follows:
n (x) = fn (x)Nn (x);

(11.8)

The result is shown in Table 11.3, with a graph comparing the model prediction to the data shown in Fig 11.1.
While the model prediction shows reasonable agreement with the overall
data, there are noticeable discrepancies, most notably the over-estimation of
the number of singletons and the consistent underestimation of the number of
multiple births, especially for the two largest cohort groupsthose receiving 3
and 4 embryos. The primary source of these discrepancies is the questionable
assumption of uniform p for each cohort group. Is it really realistic, for example, to expect all 832 patients in the cohort group that received 4 embryos
(and the 832 4 total embryos transferred in this group) to have similar values of p? In fact, this question was actually (unintentionally) answered in the
study (recall that the objective of the study was really not the determination
of implantation potential for cohort groups).
When the data is segregated by age coarsely into just two sets, the
younger set for patients 36 years old, and the older set for patients
37 years old (as was done in Tables II and III of the original paper and
summarized here in Table 11.4 for convenience), the wide variability in the
values for the single embryo probability of success parameter, p is evident.
There are several important points to note here: First, observe the less

Binomial model prediction of Elsner, et al. data in Table 11.2

No. of patients receiving n = 1, 2, . . . 6 embryos
with pregnancy outcome x
T (x)
1 (x)
2 (x)
3 (x)
4 (x)
5 (x) 6 (x) Total no. patients
with pregnancy
outcome x
204.981 281.629 375.226 436.357 26.098 1.795
1326.09
22.019 109.691 233.836 305.603 16.293 1.539
688.98
0
10.681 48.575 80.261 4.069 0.550
144.13
0
0
3.363
9.369
0.508 0.105
13.34
0
0
0
0.410
0.032
0
0.45
0
0
0
0
0
0
0
227
402
661
832
47
4
2173

x
Delivered
pregnancy
outcome
0
1
2
3
4
5
Total

TABLE 11.3:

378
Random Phenomena

Application Case Studies II: In-Vitro Fertilization

1600
Variable
Base Model Prediction
Elsner Data

1400

Number of Patients

1200
1000
800
600
400
200
0
0

2
3
4
x, Pregnancy Outcome

FIGURE 11.1: Elsner data versus binomial model prediction

TABLE 11.4:

Elsner data stratied by age indicating variability in

the probability of success estimates
Embryos Younger ( 36 yrs) Older ( 37 yrs)
Overall
recd.(n) Number
p est.
Number p est. Number p est.
1
131
0.145
96
0.031
227
0.097
2
246
0.211
156
0.087
402
0.163
432
0.184
229
0.150
661
0.172
3
522
0.160
310
0.128
832
0.149
4
5
26
0.131
21
0.086
47
0.111
2
0.083
2
0.167
4
0.125
6

379

380

Random Phenomena

obvious fact that for each cohort group, n = 1, 2, .. . . . 6, the overall p estimate
is naturally a weighted average of the values estimated for each sub-group
(younger and older); and as one would naturally expect, the weight in
each case is the fractional contribution from each sub-group to the total number. Second, and more obvious, is how widely variable the estimates of p are
across each cohort group: for example, for the group receiving n = 2 embryos,
0.087 < p < 0.211, with the combined group value of 0.163 almost twice the
value estimated for the older sub-group). This latter observation underscores
a very important point regarding the use of this particular data set for our
model validation exercise: within the context of IVF, the binomial model is an
individual patient model that predicts the probabilities of various pregnancy
outcomes for a specic patient given her characteristic parameter, p. However,
such a parameterat least in light of currently available technologycan only
be estimated from clinical data collected from many patients. Obtaining reasonable estimates therefore requires carefully designed studies involving only
patients with a reasonable expectation of having similar characteristics. Unfortunately, even though comprehensive and with just the right kind of detail
required for our purposes here, the Elsner data sets come from a retrospective
study; it is therefore not surprising if many patients in the same cohort group
do in fact have dierent implantation potential characteristics.
One way to account for such non-uniform within-group characteristics
is, of course, to repeat the modeling exercise for each data set separately
using age-appropriate estimates of p for each cohort sub-group. The results
of such an exercise are shown in Figs 11.2 and 11.3. While Fig 11.3 shows
a marked improvement in the agreement between the model and the older
sub-group data, the similarity of the model-data t in Fig 11.2 to that in
Fig 11.1 indicates that even after such stratication by age, signicant nonuniformities still exist.
There are many valid reasons to expect signicant non-uniformities to persist in the younger sub-group: (i) virtually all clinical studies on the eect of
age on IVF outcomes (e.g., Schieve et al., 1999; Jansen, 2003; and Vahratian,
et al.,2003,) recognize the age group < 29 years to be dierent in characteristics from the 29-35 years age group. Even for the older sub-group, it is
customary to treat the 40-44 years group dierently. The data set could thus
use a further stratication to improve sub-group uniformity. Unfortunately,
only the broad binary younger/older stratication is available in the Elsner et al., data set. Nevertheless, to illustrate the eect of just one more level
of stratication, consider the following postulates:
1. The n = 3 cohort group of 661 patients already stratied into the
younger 432 (
p = 0.184), and older 229 (
p = 0.150) is further stratied as follows: the younger 432 separated into 288 with p = 0.100 and
144 with p = 0.352 (maintaining the original weighted average value of
p = 0.184); and the older 229 divided into 153 with p = 0.100 and
the remaining 76 with p = 0.25, (also maintaining the same original
weighted average value of p = 0.150);

Application Case Studies II: In-Vitro Fertilization

381

900
Variable
Model Prediction (Younger)
ElsnerData (Younger)

800

Number of Patients

700
600
500
400
300
200
100
0
0

2
3
4
x, Pregnancy Outcome

FIGURE 11.2: Elsner data (Younger set) versus binomial model prediction
2. The n = 4 cohort group of 832 patients, already stratied into the
younger 522 (
p = 0.160), and older 310 (
p = 0.128) is further stratied as follows: the younger 522 separated into 348 with p = 0.08 and
174 with p = 0.320; and the older 310 into 207 with p = 0.08 and
the remaining 103 with p = 0.224 (in each case maintaining the original
respective weighted average values);
3. The n = 5 cohort group of 47 patients, with only the younger 26 with
p = 0.131 group separated into 17 with p = 0.06 and the remaining 9
with p = 0.265 (again maintaining the original weighted average value
of p = 0.131).
Upon using this simple stratication of the Elsner data, the results of the
stratied model compared with the data is shown rst in tabular form in
Table 11.5 and in Figs 11.4, 11.5 respectively for the stratied younger and
older data, and Fig 11.6 for the consolidated data. The agreement between
the (stratied) model and data is quite remarkable, especially in light of all
the possible sources of deviation of the clinical data from the ideal binomial
random variable characteristics.
The nal conclusion therefore is as follows: given appropriate parameter
estimates for the clinical patient population (even very approximate estimates
obtained from non-homogenous subgroups), the binomial model matched the
clinical data quite well. Of course, as indicated by the parameter estimates in
Table 11.4, the value of p for the patient population in the study is not constant
but is itself a random variable. This introduces an additional component to the
issue of model validation. The strategy of data stratications by p that we have
employed here really constitutes a manual (and ad-hoc) attempt at dealing

382

Random Phenomena

600

Variable
Model Pred (Older)
Elsner Data (Older)

Number of Patients

500
400
300
200
100
0
0

2
3
4
x, Pregnancy Outcome

FIGURE 11.3: Elsner data (Older set) versus binomial model prediction

TABLE 11.5:
data.
Delivered
pregnancy
outcome
x
0
1
2
3
4
5
Total

Stratied binomial model prediction of Elsner, et al.

Total number of patients

with pregnancy outcome x
Younger ( 36 yrs) Older ( 37 yrs)
Overall
Data
Ty (x)
Data
To (x)
Data T (x)
846
816
593
566
1439 1382
349
399
155
199
504
598
130
118
51
43
181
161
31
24
12
6
43
30
3
2
3
1
6
3
0
0
0
0
0
0
1359
1359
814
814
2173 2173

with this additional component indirectly. We have opted for this approach
here primarily for the sake of simplicity. A more direct (and more advanced)
approach to the data analysis will involve postulating an additional probability
model for p itself, which, when combined with the individual patient binomial
model will yield a mixture distribution model (as illustrated in Section 9.1.6 of
Chapter 9). In this case, the appropriate model for p is the Beta distribution;
and the resulting mixture model will be the Beta-Binomial model (See Exercise
9.28 at the end of Chapter 9). Such a Beta-Binomial model analysis of the
Elsner data is oered as a Project Assignment at the end of the chapter.
Finally, it is important to note that none of this invalidates the binomial
model; on the contrary, it reinforces the fact that the binomial model is a
single patient model, so that for the mixed population involved in the Elser

Application Case Studies II: In-Vitro Fertilization

383

900
Variable
Stratified Model Pred
ElsnerData (Younger)

800

Number of Patients

700
600
500
400
300
200
100
0
0

2
3
4
x, Pregnancy Outcome

FIGURE 11.4: Elsner data (Younger set) versus stratied binomial model prediction

600
Variable
Stratified Model Pred
Elsner Data (Older)

Number of Patients

500
400
300
200
100
0
0

2
3
4
x, Pregnancy Outcome

FIGURE 11.5: Elsner data (Older set) versus stratied binomial model prediction

384

Random Phenomena
1600
Variable
Stratified Model Pred (Ov erall)
Elsner Data

1400

Number of Patients

1200
1000
800
600
400
200
0
0

2
3
4
x, Pregnancy Outcome

FIGURE 11.6: Complete Elsner data versus stratied binomial model prediction
et al. clinical study, the value of p is better modeled with a pdf of its own, to
capture how p itself is distributed in the population explicitly.
We will now proceed to use the binomial model for analysis and optimization.

11.5

Problem Solution: Model-based IVF Optimization

and Analysis

For any specied p, the binomial model provides a quantitative means of

analyzing how the probability of each pregnancy outcome, x, varies with n,
the number of embryos transferred at each treatment cycle. In particular, we
are concerned with the following three probabilities:
1. P0 = P (X = 0), the probability of an unsuccessful treatment cycle that
produces no live birth;
2. P1 = P (X = 1), the probability of obtaining a singleton (the most
desirable pregnancy outcome); and,
3. PMB = P (X > 1), the probability of obtaining multiple births (where
there is no particular distinction in terms of the actual number once it
is greater than 1).
Our interest in these specic probabilities should be obvious: at each IVF cycle, the objective is to reduce the rst and last probabilities while maximizing

Application Case Studies II: In-Vitro Fertilization

385

the second. From the binomial model, these probabilities are given explicitly
as:
P0
P1
PMB

(1 p)n

=
=

np(1 p)
P (X > 1) = 1 (P0 + P1 )

(11.10)

1 (1 p)n np(1 p)n1

(11.11)

(11.9)
n1

Note that these three probabilities are constrained to satisfy the expression:
1 = P0 + P1 + PMB

(11.12)

with the all-important implication that any one of these probabilities increases
(or decreases) at the expense of the others. Still, each probability varies with
n in a distinctive manner that can be exploited for IVF optimization, as we
now show.

11.5.1

Optimization

In the most general sense, the optimum number of embryos to transfer in any IVF cycle is that number n which simultaneously minimizes P0 ,
maximizes P1 , and also minimizes PMB . From the model equations, however,
observe that (a) P0 is a monotonically decreasing function of n, with no minimum for nite n; (b) although not as obvious, PMB has no minimum because
it is a monotonically increasing function of n; but fortunately (c) P1 does in
fact have a maximum. However, the most important characteristic of these
probabilities is the following: by virtue of the constraint in Eq. (11.12), maximizing P1 also simultaneously minimizes the combined sum of the undesirable
probabilities (P0 + PMB )!
We are therefore faced with the fortunate circumstance that the IVF
optimization problem can be stated mathematically simply as:

n = arg max np(1 p)n1
n

(11.13)

and that the resulting solution, the optimum number of embryos to transfer
which maximizes the probability of obtaining a singleton, also simultaneously
minimizes the combined probability of undesirable side eects.
The closed-form solution to this problem is obtained via the usual methods
of dierential calculus as follows:
Since whatever maximizes
fn (1) = np(1 p)n1

(11.14)

also maximizes ln fn (1), solving

d ln fn (1)
1
= + ln(1 p) = 0
dn
n

(11.15)

386

Random Phenomena
0.3

Optimum number of embryos

Variable
n*
n*_quant

5
3
0
0.0

0.1

0.2

0.3

0.4
p

0.5

0.6

0.7

0.8

FIGURE 11.7: Optimum number of embryos as a function of p

for n immediately yields the desired solution,

1
= ln
n

1
1p

(11.16)

with the following implications:

Given p, the probability that a particular single embryo will lead to

a successful pregnancy, the optimum number of embryos to transfer
during IVF is given by the expression in Eq (11.16) rounded to the
nearest integer.

A plot of n as a function of p is shown in Fig 11.7, the actual continuous

value and the corresponding quantized (rounded to the nearest integer)
value. The indicated reference lines show, as an example, that for patients for
whom p = 0.3, the optimum number of embryos to transfer is 3.

11.5.2

Model-based Analysis

The general binomial pdf in Eq (11.1), or the more specic expressions for
the probabilities of direct interest to IVF treatment derived from it, and shown
in Eqs (11.9), (11.10) and (11.11), provide some insight into the probabilistic
characteristics of IVF treatment. For the most desirable pregnancy outcome,

Application Case Studies II: In-Vitro Fertilization

387

x = 1, the singleton live birth, Fig 11.8 shows the complete surface plot of
fn (1) = np(1 p)n1

(11.17)

as a function of n and p. Note the general nature of the surface but in particular
the distinctive ridge formed by the maxima of this function. The following are
some important characteristics of IVF this gure reveals:
As indicated by the lines sweeping from left to right in the gure, for any
given n embryos transferred, there is a corresponding patient SEPS parameter
p for which the probability of obtaining a singleton is maximized. Furthermore,
as n increases (from back to front), the appropriate maximal p is seen to
decrease, indicating that transferring small numbers of embryos works best
only for patients with high probabilities of success, while transferring large
numbers of embryos works best for patients for whom the probability of success
is relatively low. It also shows that for those patients with relatively high
probabilities of success (for example, young patients under 35 years of age),
transferring large numbers of embryos is counterproductive: the probability
of obtaining singletons in these cases is remarkably low across the board (see
the at surface in the bottom right hand corner of the gure) because the
conditions overwhelmingly favor multiple births over singletons. All this, of
course, is in perfect keeping with current thought and practice; but what is
provided by Eq (11.17) and Fig 11.8 is quantitative.
The complementary observation from the lines sweeping from back to front
is that for a given single embryo probability of success, p, there is a corresponding number of embryos to transfer that will maximize the probability of
obtaining a singleton. Also, as p increases (from left to right), this optimum
number is seen to decrease. This, of course, is what was shown earlier quite
precisely in Fig 11.7.
Finally, when the optimum number of embryos, n , are transferred for each
appropriate value of the SEPS parameter, p, (the mathematical equivalent of
walking along the ridge of the mountain-like surface in the gure), the corresponding theoretical maximum probability of obtaining a singleton increases
with p in a manner shown explicitly in Fig 11.9. The indicated elbow discontinuity occurring at p = 0.5 is due to the fact that for p < 0.5, n > 1 and
fn (1) involves integer powers of p; but for p 0.5, n = 1 so that fn (1) = p,
a straight line with slope 1.
For the sake of completeness, Figures 11.10 and 11.11 show the surface
plots respectively for
fn (0) = (1 p)n

(11.18)

fn (m) = 1 (1 p)n np(1 p)n1

(11.19)

and

corresponding to Fig 11.8.

388

Random Phenomena

1.0

f(1)

0.5

0.0
0

5
1.0
10

0.5

0.0

FIGURE 11.8: Surface plot of the probability of a singleton as a function of p and the
number of embryos transferred, n

0.8

f*(1)

0.7

0.6

0.5

0.4
0.0

0.1

0.2

0.3

0.4
p

0.5

0.6

0.7

0.8

FIGURE 11.9: The (maximized) probability of a singleton as a function of p when the

optimum integer number of embryos are transferred

Application Case Studies II: In-Vitro Fertilization

389

1.0

f(0)

0.5

0.0
0

5
10

1.0

0.5

0.0

FIGURE 11.10: Surface plot of the probability of no live birth as a function of p and
the number of embryos transferred, n

1.0

f(m)

0.5

0.0
0
n

1.0

0.5
10

0.0

FIGURE 11.11: Surface plot of the probability of multiple births as a function of p

and the number of embryos transferred, n

390

Random Phenomena

11.5.3

Patient Categorization and Theoretical Analysis of

Treatment Outcomes

We now return to Fig 11.7 and note that it allows us to categorize IVF
patients on the basis of p (and, by extension, the optimum prescribed number
of embryos to transfer) as follows.
1. Good prognosis patients: p 0.5.
For this category of patients, n = 1, with the probability of obtaining
a singleton, f (1) = 0.5.
2. Medium prognosis patients: 0.25 p < 0.5.
For this category of patients, n = 2 or 3, with the probability of obtaining a singleton, 0.42 < f (1) < 0.5.
3. Poor prognosis patients: 0.15 p < 0.25.
For this category of patients, 4 < n < 6, with the probability of obtaining a singleton, 0.40 < f (1) < 0.42.
4. Exceptionally poor prognosis patients: p < 0.15
For this category of patients, n > 6, but even then the probability of
obtaining a singleton, f (1) 0.40.
Let us now use the probability model to examine, for each patient category,
how the number of embryos transferred inuences the potential treatment
outcomes.
Beginning with a representative value of p = 0.5 for the good prognosis
category (e.g. women under age 35, as was the case in the study quoted in the
WSJ article that we used to illustrate the estimation of p in Eqs (11.2) and
(11.3)), Fig 11.12 shows a plot of the probabilities of the outcomes of interest,
P0 , P1 , and PMB , as a function of n.
A few points are worth noting in this gure: rst, P1 , the probability of
obtaining a singleton, is maximized for n = 1, as noted previously; however,
this gure also shows that the same probability P1 = 0.5 is obtained for n = 2.
Why then is n = 1 recommended and not n = 2? Because with n = 1, there
is absolutely no risk whatsoever of multiple births, whereas with n = 2, the
probability of multiple births (only twins in this case) is no longer zero, but
a hard-to-ignore 0.25.
Note also that transferring more than 1 or 2 embryos for this class of patients actually results in a reduction in the probability of a singleton outcome,
at the expense of rapidly increasing the probability of multiple births. (Recall
the constraint relationship between the probabilities of pregnancy outcomes
shown in Eq (11.12).)
When 5 embryos are transferred, there is an astonishing 85% chance of
multiple births, and specically, a 1 in 32 chance of quintuplets (details not
shown but easily computed from Eq (11.1)). Thus, going back to the WSJ

Application Case Studies II: In-Vitro Fertilization

0.9

Variable
P0
P1
PMB

0.8
Treatment Outcome Probability

391

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1

2
3
4
Number of embryos transfered, n

FIGURE 11.12: IVF treatment outcome probabilities for good prognosis patients
with p = 0.5, as a function of n, the number of embryos transferred

story of Theresa Anderson with which we opened this chapter, it is now clear
that perhaps what her doctor meant to say was that there was a one chance
in 30 that all ve embryos would take rather than that only one would take.
With the binomial probability model, one could have predicted the distinct
possibility of this particular patient delivering quintuplets because she belongs
to the category of patients with good prognosis, for which p 0.5.
Next, consider a representative value of p = 0.3 for medium prognosis
patients, which yields the plots shown in Fig 11.13 for the probabilities P0 , P1 ,
and PMB as a function of n. Observe that as noted previously, the optimum
n corresponding to this specic value of p = 3 is clearly 3. Transferring fewer
embryos is characterized by much higher values for P0 , the probability of
producing no live birth; transferring 4 embryos increases the probability of
multiple births more than is oset by the simultaneous reduction in P0 ; and
with 5 embryos, the probability of multiple births dominates all other outcome
probabilities.
Finally, when a representative value of p = 0.18 is selected for poor
prognosis patients, the resulting outcome probability plots are shown in Fig
11.14. First note that for this value of p, the optimum n is 5. Recall that the
Combelles et al., 2005, study concluded from evidence in their clinical data
that n = 5 is optimum for women more than 40 years of age. In light of our
theoretical analysis, the implication is that for the class of patients referred
to in this study, p = 0.18 is a reasonable characteristic parameter. As an
independent corroboration of the model-based analysis shown in this gure,
consider the following result from the Schieve, et al., 1999, study which states:

392

Random Phenomena
Variable
P0
P1
PMB

Treatment Outcome Probability

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1

2
3
4
Number of Embryos Transferred, n

FIGURE 11.13: IVF treatment outcome probabilities for medium prognosis patients
with p = 0.3, as a function of n, the number of embryos transferred
Among women 40 to 44 years of age, the multiple-birth rate was
less than 25% even if 5 embryos were transferred.
If the patients in this study can reasonably be expected to have characteristics
similar to those in the Combelles study, then from our preceding theoretical
analysis, p = 0.18 is a reasonable estimate for them also. From our probability
model, the specic value for PMB when n = 5 for this class of patients is
therefore predicted to be 0.22 (dotted line and diamond symbol for n = 5),
which agrees precisely with the above-noted result of the Schieve et al., study.

11.6
11.6.1

Sensitivity Analysis
General Discussion

Clearly, the heart of the theoretical model-based analysis presented thus far
is the parameter p. This, of course, is in agreement with clinical practice, where
embryo transfer policies are based on what the Canadian guidelines refer to as
patient-specic, embryo-specic, and cycle-specic determinants of implantation and live birth. From such a model-based perspective, this parameter
is therefore the single most important parameter in IVF treatment: it determines the optimum number of embryos to transfer, and, in conjunction with
the actual number of embryos transferred (optimum or not), determines the
various possible pregnancy outcomes and the chances of each one occurring.

Application Case Studies II: In-Vitro Fertilization

0.9

Variable
P0
P1
PMB

0.8
Treatment Outcome Probability

393

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0

4
6
8
Number of Embryos Transferred, n

FIGURE 11.14: IVF treatment outcome probabilities for poor prognosis patients
with p = 0.18, as a function of n, the number of embryos transferred

Given such importance, and since no parameter can be estimated perfectly,

it is entirely reasonable to ask: what happens when the model parameter p
is estimated inaccurately, and/or imprecisely? Or, in more practical terms:
how sensitive to inevitable parameter estimation errors are the optimization
results, model predictions, and other model-based analyses?
It turns out that the binomial model-based analysis of IVF treatment
presented thus far is remarkably robust, being quite insensitive to errors in
the estimates of p. But before providing a general, rigorous sensitivity analysis,
let us rst examine the practical implications of over- or under-estimating p
for some specic cases of practical importance discussed previously. Consider,
for example, the good prognosis patients for which the true but unknown
value of p is 0.5, so that the true outcome probabilities are as shown in Fig
11.12; overestimating p as 0.6 (or even 0.7 or higher) leads to a determination
of n still as 1 (see Fig 11.7), and the end result will be the transfer of a single
embryo, precisely as would have been done in the nominal (no error) case,
meaning that there are no practical consequences for such an overestimation.
Conversely, underestimating p as 0.4 (a 20% error) leads to an overestimation
of n as 2 (again, see Fig 11.7), and from Fig 11.12, we see the primary
consequences: the probability of twins increases to 0.25 (from the nominal
case of 0) at the same time that the probability of no live birth drops to 0.25
(from the nominal case of 0.5); however, the probability of a singleton remains
at the optimum value of 0.5 as in the nominal case.
For medium prognosis patients for which p = 0.3 in actual fact but is
overestimated as 0.4 (a 33% error), the result is that instead of transferring
n = 3, only 2 embryos will be transferred. We see from Fig 11.13 that for the

394

Random Phenomena

most desirable outcome probability P1 , the consequence of transferring only 2

embryos is surprisingly minimal; the more serious consequence is the increase
in P0 , the probability of no live birth, which is somewhat oset by a reduction
in PMB for multiple births. Underestimating p as 0.25 (a 16.7% error) leads to
no change in n ; it will take a 33% underestimation error (
p = 0.2) to change
the recommended n to 4. Again, the implications for P1 is minimal; P0 is
reduced, but at the expense of an increase in PMB . Still, even under these
conditions, P1 remains the highest of the three probabilities.
For the poor prognosis patients characterized in Fig 11.14, with p = 0.18,
overestimating p as 0.26 or underestimating it as 0.1 (absolute magnitude errors of 0.08, or 44.4%) leads to the following consequences: overestimation
leads to a transfer of 3 embryos instead of 5, with minimal implications for P1
but with a substantial increase in P0 that is partially oset by a concomitant
reduction in PMB ; underestimation leads to a seriously asymmetric increase
to 9 in the embryos transferred, with serious implications for the balance between the outcome probabilities which shifts in favor of increasing PMB (while
decreasing P0 ) but at the expense of decreasing the desirable P1 .
The point of this discussion is that in practical terms, it will take substantial percentage estimation errors (mostly in the direction of underestimating
p) to aect IVF treatment outcome probabilities to any noticeable extent.
The reason for this robustness should be evident from Fig 11.7 which shows
how insensitive the value of n is to p for values in excess of 0.3. Thus, for
good prognosis patients (p > 0.5), almost regardless of the actual specic
value estimated for p, n is always 1; for medium prognosis patients, for
p between 0.25 and 0.5 a range so broad it covers half the probabilities
associated with all patient prognoses not considered as good n is either
2 or 3. Thus the model prescription is completely insensitive for good prognosis patients; and its recommendation of 2 or 3 embryos over the wide range
0.25 p < 0.5 indicates remarkably mild sensitivity for medium prognosis
patients. The steep climb in the optimum number of embryos as p decreases
from 0.2 indicates increased sensitivity to errors in p for poor prognosis
patients. Nevertheless, given that the nominal values of p are also low in this
range, the relative sensitivity is in fact not as high, as we now show with the
following general theoretical derivations.

11.6.2

Theoretical Sensitivity Analysis

The general question of interest here is: how sensitive is the model and its
analysis results to errors in p?, or in more practical terms, how good does
the estimate of p have to be so that the prescription of n , and the theoretical analysis following from it, can be considered reliable? Such questions are
answered quantitatively with the relative sensitivity function, dened in this
case as:

Application Case Studies II: In-Vitro Fertilization

Sr =

ln n
.
ln p

395

(11.20)

It is perhaps best understood in terms of what it means for the transmission

of errors in p into errors in the recommended n : a relative sensitivity, Sr ,
implies that an error p in p translates to an error n in n according to
p
n
Sr
n
p

(11.21)

From the expression in Eq (11.16), we obtain, through the usual calculus

techniques, and after some simple algebraic manipulations, the closed form,
analytical expression,
Sr =

p
(1 p) ln(1 p)

(11.22)

a plot of which is shown in Fig 11.15. First, note in general that Sr is always
negative and greater than 1 in absolute value. This implies that (i) overestimating p always translates to underestimating n , and vice versa (as we
already alluded to in the preceding general discussion); and (ii) the relative
over- or under-estimation error in p is always magnied in the corresponding relative error in n . Quantitatively, the specic information contained in
this gure is best illustrated by considering what it indicates about the region 0 < p < 0.25 identied in our previous discussion for poor prognosis
patients as critical in terms of sensitivity to errors in p. Observe that in this
region, 1.16 < Sr < 1.0, with the implication that a 10% error in p estimates translates to no more than an 11.6% error in the recommended n .
Keep in mind that in practice, fractional errors in n are inconsequential until
they become large enough be rounded up (or down) to the nearest integer.
Thus, an 11.6% error on a nominal value of n = 10 (or for that matter any
error less than 15%) translates to only 1 embryo. Thus, even though from
our previous discussion we know that good estimates of p are required for
poor prognosis patients, the relative sensitivity function (Eq (11.22), and
Fig 11.15) indicate that the model can tolerate as much as 10% error in the
estimate of p with little or no consequence in the nal results.

11.7
11.7.1

Summary and Conclusions

Final Wrap-up

The fundamental premise of this chapter is that the random phenomenon

inherent to IVF treatment (and attendant issues), fraught as it is with uncertainty, is in fact amenable to systematic analysis via probability theory. In

396

Random Phenomena
-1.0
-1.2

Relative Sensitivity

-1.4
-1.6
-1.8
-2.0
-2.2
-2.4
-2.6
0.0

0.1

0.2

0.3

0.4
p

0.5

0.6

0.7

0.8

FIGURE 11.15: Relative sensitivity of the binomial model derived n to errors in

estimates of p as a function of p

particular, in an IVF treatment cycle involving the transfer of n embryos, the

pregnancy outcome, X, is well-modeled as a Binomial, Bi(n, p) random variable, where the binomial parameter p represents the single embryo probability
of success. Thus, even though the actual pregnancy outcome is uncertain, the
probability of obtaining any one of the n possible outcomes (no live birth,
a singleton, twins, triplets, quadruplets, . . ., n-tuplets) is given by the pdf
shown in Eq (11.1). Of particular interest are the specic and explicit expressions derived from this general pdf and shown in Eqns (11.9)(11.11) for the
probabilities P0 , P1 , and PMB , of no live birth, a singleton, and multiple births
(all other live births that are not singletons) respectively.
Even though it was not necessarily designed for validating the binomial
model (an exercise that would have required carefully controlling for sources
of error due to excessive and undesirable parameter variation, by using only
cohort groups with approximately homogeneous characteristics), we were still
able to use the clinical data from the Elsner study (Elsner et al. 1997) to show
that the proposed binomial model indeed provides a reasonable representation
of reality.
The primary advantage of such a theoretical model is that it can be used
to solve analytically, in a general fashion, the vexing problem of determining
the optimum number of embryos to transfer in any particular IVF treatment
cyclean optimization problem requiring the maximization of the chances of
delivering a singleton while simultaneously minimizing the chances of obtaining undesirable outcomes (no live births, and multiple births). The binomial
model provides the additional bonus that a single optimization problem simultaneously achieves both objectives. The nal result, the explicit mathematical

Application Case Studies II: In-Vitro Fertilization

397

expression shown in Eq (11.16) and plotted in Fig 11.7, states this optimum
number of embryos, n , explicitly as a function of the parameter p. Also, we
have shown that these results are robust to unavoidable errors in estimating
this parameter in practice.

11.7.2

Conclusions and Perspectives on Previous Studies

and Guidelines

Here, then, is a series of key conclusions from the discussion in this chapter.
First, the following is a list of the characteristics of the binomial model of
IVF treatment:
1. It provides a valid mathematical representation of reality;
2. It depends on a single key parameter, p, whose physiological meaning is
clear (the single embryo probability of successor the embryo implantation potential); and
3. It is used to obtain an explicit expression for the optimum number of embryos to transfer in an IVF cycle, and this result is robust to uncertainty
in the single model parameter.
Next, we note that the binomial model-based prescription of the optimum number of embryos to transfer agrees perfectly with earlier heuristics
and guidelines developed on the basis of specic empirical studies. While the
precise number can be obtained analytically from Eq (11.16), the practical
implications of the results may be summarized as follows:
1. For good prognosis patients, p 0.5, transfer only 1 embryo;
2. For medium prognosis patients, 0.25 < p < 0.5, transfer 2 embryos
for those with p 0.35 and 3 for those with p < 0.35;
3. For poor prognosis patients, p < 0.25, transfer n > 3 embryos with the
specic value in each case depending on the value of p, as determined by
Eq (11.16) rounded to the next integer: for example, n = 4 for p = 0.2;
n = 5 for p = 0.18, n = 6 for p = 0.15, etc.
These results agree with, but also generalize, the results of previous clinical
studies, some of which were reviewed in earlier sections. Thus, for example,
the primary result in the Combelles et al., 2005, study, that n = 5 is optimum
for patients older than 40 years, strictly speaking can only be considered valid
for the specic patients in the study used for the analysis. The prescription
of the binomial model on the other hand is general, and not restricted to any
particular data set; it asserts that n = 5 is optimal for all patients for which
p = 0.18 whether they are 40 years old, younger, or older. This leads to the
nal set of conclusions having to do with the perspective on previous studies
and IVF treatment guidelines provided by the binomial model-based analyses.

398

Random Phenomena

In light of the analyses and discussions in this chapter, the most important
implication of the demonstrated appropriateness of the binomial model for
IVF treatment is that treatment guidelines should be based on the value of
the parameter p for each patient, not so much age. (See recommendation 1 of
the Canadian guidelines summarized earlier.) From this perspective, age in the
previous studies is seen as a convenientbut not always a perfectsurrogate
for this parameter. It is possible, for example, for a younger person to have
a lower SEPS parameter p, for whatever reason, uterine or embryo-related.
Conversely, an older patient treated with eggs from a young donor will more
than likely have a higher-than-expected SEPS parameter value. In all cases,
no conicts arise if the transfer policy is based on the best estimate of p rather
than age: p is the more direct determinant of embryo implantation rate; age
is an indirect and not necessarily foolproof, indicator.
On the basis of this sections model-based discussion therefore, all the
previous studies and guidelines may be consolidated as follows:

1. For each patient, obtain the best estimate of the SEPS parameter, p;
2. Use p to determine n either from the analytical expression in Eq (11.16)
rounded to the nearest integer, or else from Fig (11.7);
3. If desired, Eqns (11.9), (11.10) and (11.11) may then be used to analyze
outcome probabilities given the choice of the number of embryos to
transfer (see for example, Fig 11.13).

Finally, it should not be lost on the reader just how much the probability modeling approach discussed in this chapter has facilitated the analysis
and optimization of such a complex and important problem as that posed by
IVF outcome optimization. Even with the unavoidable idealization implied
in the SEPS parameter, p, this binomial model parameter provides valuable
insight into the fundamental characteristics of the IVF outcome problem. It
also allows a consolidation of previous qualitative results and guidelines into
a coherent and quantitative three-point guideline enumerated above.

References
1. Austin, C. M., S.P. Stewart, J. M. Goldfarb, et al., 1996. Limiting multiple
pregnancies in in Vitro fertilization/embryo transfer (IVF-ET) cycles, J. Assisted Reprod and Genetics, 13 (7) 540-545.
2. Bolton, V.N., S.M. Hawes, C.T., Taylor and J.H. Parsons, 1989. J In Vitro
Fert Embryo Transf., 6 (1) 30-35.
3. Combelles, C.M.H., B. Orasanu, E.S. Ginsburg, and C. Racowsky, 2005. Optimum number of embryos to transfer in women more than 40 years of age

Application Case Studies II: In-Vitro Fertilization

399

undergoing treatment with assisted reproductive technologies, Fert. and Ster.,

84 (6) 1637-1642.
4. Elsner, C.W., M.J. Tucker, C.L. Sweitzer, et al., 1997. Multiple pregnancy
rate and embryo number transferred during in vitro fertilization, Am J. Obstet
Gynecol., 177 (2) 350-357.
5. Engmann, L., N. Maconochie, S.L. Tan, and J. Bekir, 2001. Trends in the
incidence of births and multiple births and the factors that determine the
probability of multiple birth after IVF treatment, Hum Reprod., 16 (12) 25982605.
6. Geber, S. and M. Sampaio, 1999. Blasotmere development after embryo
biopsy: a new model to predict embryo development and to select for transfer,
Hum Reprod., 14 (3) 782-786.
7. Jansen, R. P. S., 2003. The eect of female age on the likelihood of a live birth
from one in-vitro fertilisation treatment, Med. J. Aust., 178, 258-261.
8. Pandian, Z., S. Bhattacharya, O. Ozturk, G.I. Serour, and A. Templeton,
2004. Number of embryos for transfer following in-vitro fertilisation or intracytoplasmic sperm injection, Cochrane Database of Systematic Reviews, 4, Art
No. CD003416. DOI:10.1002/14651858.CDC003146.pub2.
9. Patterson, B., Nelson, K.B., Watson, L. et al., 1993. Twins, triplets, and cerebral palsy in births in Western Australia in the 1980s, Brit. Med. J. 307,
1239-1243.
10. Reynolds, M. A., L.A. Schieve, G. Jeng, H.B. Peterson, and L.S. Wilcox, 2001.
Risk of multiple birth associated with in vitro fetilization using donor eggs,
Am J. Epidemiology, 154 (11), 1043-1050.
11. Scheive, L.A., H.B.Peterson, S. Miekle, et al., 1999. Live birth rates and
multiple-birth risk using in vitro fertilization, JAMA, 282 1832-1838.
12. Strandel, A., C Bergh, and K. Lundin, 2000. Selection of patients suitable for
one-embryo transfer may reduce the rate of multiple births by half without
impairment of overall birth rates, Hum Reprod., 15 92) 2520-2525.
13. A. Templeton and J. K. Morris, 1998. Reducing the risk of multiple births by
transfer of two embryos after in vitro fertilization, The New Eng J. Med., 339
(9) 573-577.
14. A. Thurin, J. Hauske, T Hllensj
o, et al., 2004. Elective single-embryo transfer
versus double-embryo transfer in in Vitro fertilization, The New Eng J. Med.,
351 (23) 2392-2402.
15. A. Vahratian, L. A. Schieve, M.A.Reynolds, and G. Jeng, 2003. Live-birth
rates and multiple-birth risk of assisted reproductive technology pregnancies
concieved using thawed embryos, USA 1999-2000, Hum Reprod., 18 (7), 14421448.
16. Yaron, Y, A. Amit, A. Kogosowski, et al., 1997. The optimal number of embryos to be transferred in shared oocyte donation: walking the thin line between low pregnancy rates and multiple pregancies, Hum Reprod., 12 (4) 699702

400

Random Phenomena

PROJECT ASSIGNMENT
Beta-Binomial Model for the Elsner Data.
As noted at the end of Section 11.4, to deal appropriately with the mixed
population involved in the Elsner clinical study, a theoretical probability
model should be used for p; this must then be combined with the binomial
model to yield a mixture model. When the Beta B(, ) model is chosen for
p, the resulting mixture model is known as a Beta-Binomial model.
As a project assignment, develop a Beta-Binomial model for the
Elsner data in Table 11.2 and compare the model prediction with
the data and with the binomial model prediction presented in this
chapter.
You may approach this assignment as follows:
The Beta-Binomial mixture distribution arises from a Binomial Bi(n, p)
random variable, X, whose parameter p, rather than being constant, has a
Beta distribution, i.e., it consists of a conditional distribution,

n x
(11.23)
f (x|p) =
p (1 p)nx
x
in conjunction with the marginal distribution for p,
f (p) =

( + ) 1
p
(1 p)1 ; 0 < x < 1; > 0; > 0
()()

(11.24)

1. Obtain the expression for f (x), the Beta-Binomial pdf.

2. Show that the mean and variance of this pdf are as follows:

E(X) = n
= n
+

(11.25)

and, for dened as in Eq (11.25),

V ar(X) = n(1 );

(11.26)

with dened as:

++n
++1

(11.27)

(Caution: These results are not easy to obtain by brute force.)

Compare and contrast the expressions in Eqs (11.25) and (11.26) with
the corresponding expressions for the Binomial random variable.
3. Compute the mean x
, and variance s2 , of the complete Elsner data shown
in Table 11.2. Determine appropriate estimates for the Beta-Binomial
model parameters, and , in terms of the values computed for x
and
s2 from the data.

Application Case Studies II: In-Vitro Fertilization

401

4. Plot the theoretical pdf for f (p) using the values you determined for
the parameters and . Discuss what this implies about how the SEPS
parameter is distributed in the population involved in the Elsner clinical
study.
5. Compute probabilities for x = 0, 1, . . . 6, using the Beta-Binomial model
and compare with the data.
Write a report describing your model development and data analysis, including comments on the t of this mixture model to the data versus the
binomial model t that was discussed in this chapter.

402

Random Phenomena

Part IV

Statistics

403

405

Part IV: Statistics

Quantifying Random Variability

The days of our years are threescore years and ten; and if by reason of strength they be fourscore years, yet is their strength labor
and sorrow; for it is soon cut o, and we y away. . . . So teach us
to number our days that we may apply our hearts unto wisdom.
Moses (c. 1450 BC) Psalms 90:10,12, KJV

406

Part IV: Statistics

Quantifying Random Variability

Chapter 12: Introduction to Statistics

Chapter 13: Sampling
Chapter 14: Estimation
Chapter 15: Hypothesis Testing
Chapter 16: Regression Analysis
Chapter 17: Probability Model Validation
Chapter 18: Nonparametric Methods
Chapter 19: Design of Experiments
Chapter 20: Application Case Studies III: Statistics

Chapter 12
Introduction to Statistics

12.1 From Probability to Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12.1.1 Random Phenomena and Finite Data Sets . . . . . . . . . . . . . . . . . . . . . . .
12.1.2 Finite Data Sets and Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
Populations, Samples and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.1.3 Probability, Statistics and Design of Experiments . . . . . . . . . . . . . . . .
12.1.4 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Inductive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Statistical Design of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2 Variable and Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.3 Graphical Methods of Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.3.1 Bar Charts and Pie Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.3.2 Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.3.3 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.3.4 Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.4 Numerical Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.4.1 Theoretical Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . . . .
The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.4.2 Measures of Central Tendency: Sample Equivalents . . . . . . . . . . . . . .
Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sample Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sample Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.4.3 Measures of Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Average Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sample Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . .
12.4.4 Supplementing Numerics with Graphics . . . . . . . . . . . . . . . . . . . . . . . . . .
12.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

408
408
411
412
414
415
415
416
416
417
418
419
423
427
430
435
436
436
437
438
438
438
439
439
439
440
440
440
442
444
447
447
449
454

To understand Gods thoughts we must study Statistics

for these are the measure of His purpose.
Florence Nightingale (18201910)

The uncertainty intrinsic to individual observations of randomly varying phenomena, we now know, need not render mathematical analysis impossible. The
discussions in Parts II and III have shown how to carry out such analysis, so
long as one is willing to characterize the more stable complete ensemble, and
407

408

Random Phenomena

not the capricious individual observations. The key has been identied as
the ensemble characterization, via the probability distribution function (pdf),
f (x), which allows one to carry out probabilistic analysis about the occurrence
of various observations of the random variable X.
In practice, however, f (x), the theoretical ensemble model for the random
phenomenon in question, is never fully available we may know its form,
but the parameters are unknown and must be determined for each specic
problem. Only a nite collection of individual observations {xi }ni=1 is available.
But any nite set of observations will, like individual observations, be subject
to the vagaries of intrinsic variability. Thus, in any and all analysis carried out
on this basis, uncertainty is unavoidable. Given this practical reality, how then
does one carry out systematic analysis of random phenomena which requires
the full f (x) on the basis of nite data? How does one fully characterize the
entire ensemble from only a nite collection of observations? Clearly, {xi }ni=1
is related to, and contains information about, the full f (x); but how are they
related, and how can this information be extracted and exploited? What kinds
of analyses are possible with nite observations, and how does one cope with
the unavoidable uncertainty?
Statistics provides rigorous answers to such questions as these by using the
very same probability machinery of Parts II and III to deal generally with the
theoretical and practical issues associated with analyzing randomly varying
phenomena on the basis of nite data. Our study begins in this chapter with
an introduction of Statistics rst as a conceptual framework complementary
to and dependent on Probability. We then provide a brief overview of
the components of this Statistical Analysis framework, as a prelude to the
detailed study of each of these components in the remaining chapters of Part
IV.

12.1
12.1.1

From Probability to Statistics

Random Phenomena and Finite Data Sets

A systematic study of randomly varying phenomena involves three distinct

but intimately related entities:
1. X: the actual variable of interest.
This is the random variable discussed extensively in Chapter 4 and illustrated in Chapter 1 (prior to any formal discussions) with two examples:
the yield obtainable from chemical processes (two continuous random
variables), and the number of inclusions on a manufactured glass sheet
of specied area (a discrete random variable). It is an abstract, conceptual entity.

Introduction to Statistics

409

2. {xi }ni=1 : n individual observations; one set out of many other possible
realizations of the random variable.
This is commonly referred to as the data; it is the only entity available
in practice (from experiments). In the illustrative examples of Chapter
1, n = 50 each for process A, and process B; n = 60 for the glass sheets.
3. f (x): aggregate (or ensemble) description of the random variable; the
probability distribution function.
This is the theoretical model of how the probability of obtaining various results are distributed over the entire range of all possible values
observable for X. It was discussed and characterized generically in Chapter 4, before specic forms were derived for various random variables of
interest in Chapters 8 and 9. There we saw that it consists of a functional form, f (x), and characteristic parameters; it is therefore more
completely represented as f (x|), which literally reads f (x) given ,
where is the vector of characteristic parameters. In the rst illustrative
example of Chapter 1, f (x) is the Gaussian distribution, with parameters = (, 2 ); in the second, it is a Poisson distribution with one
characteristic parameter, .
Probabilistic random phenomena analysis is based entirely on f (x). This
is what allows us to abandon the impossible task of predicting an intrinsically unpredictable entity in favor of the more mathematically tractable task
of computing the probabilities of observing the randomly varying outcomes.
Until now, our discussion about probability and the pdf has been based on the
availability of the complete f (x), i.e., f (x|) with known. This allowed us
to focus on the rst task: computing probabilities and carrying out analysis,
given any functional form f (x) with values of the accompanying characteristic parameters, , assumed known. We have not been particularly concerned
with such practical issues as where either the functional form, or the specic
characteristic parameter values come from. With the rst task complete, we
are now in a position to ask important practical questions: in actual practice,
what is really available about any random variable of interest? and how does
one go about obtaining the complete f (x) required for random phenomena
analysis?
It turns out that for any specic randomly varying phenomenon of interest,
the theoretical description, f (x), is never completely available, usually because
the characteristic parameters associated with the particular random variable
in question are unknown; only nite data in the form of a set of observations
{xi }ni=1 ; (n < ), is available in practice. The immediate implication is that
to apply the theory of random phenomena analysis successfully to practical
problems, we must now confront the practical matter of how the complete
f (x) is to be determined from nite data, for any particular random variable,
X, of interest. The problem at hand is thus one of analyzing randomly varying
phenomena on the basis of nite data sets; this is the domain of Statistics.
With Probability, f (x) the functional form along with the parameters is

410

Random Phenomena

given, and analysis involves determining the probabilities of obtaining specic

observations {xi }ni=1 from the random variable X. The reverse is the case with
Statistics: {xi }ni=1 is given, and analysis involves determining the appropriate
(and complete) f (x) the functional form and the parameters for the
random variable X that generated the given specic observation. (Readers
familiar with the subject matter of Process Control may recognize that the
relationship between Statistics and Probability, even in the skeleton form given
above, is directly analogous to the relationship between Process Identication
and Process Dynamics, with the classical transfer function model g(s), or
any other dynamic model form playing the role of the pdf f (x).) One should
not be surprised then that the theoretical concept of the pdf, f (x), still plays a
signicant role in handling this new reverse problem. This function provides
the theoretical basis for determining, from the nite data set {xi }ni=1 , the most
likely underlying complete f (x), and to quantify the associated uncertainty.
It is in this sense that Statistics is referred to as a methodology for inferring
the characteristics of the complete pdf, f (x), from nite data sets {xi }ni=1 , and
for quantifying the associated uncertainty. In general, Statistics is typically
dened as a methodology for
1. eciently extracting information from data; and
2. eciently quantifying such information,
a denition broad enough to encompass all aspects of the subject matter,
as we show in the remaining chapters of Part IV. Thus, while the focus of
Probability is the random variable, X, the focus of Statistics is the nite
data set {xi }ni=1 as a specic realization of X.
Example 12.1 PROBABILITY AND STATISTICAL ANALYSIS FOR COIN TOSS EXPERIMENT: A PREVIEW
Consider the illustrative experiment introduced in Example 3.1 in Chapter 3, and in Example 4.1 of Chapter 4, which involved tossing a
coin 3 times and recording the number of observed tails. For a specic coin, after performing this three-coin toss experiment exactly 10
times under identical conditions, the following result set was obtained:
S1 = {0, 1, 3, 2, 2, 1, 0, 1, 2, 2}.
(1) What is the random variable X, and for a generic coin, what
is the theoretical ensemble description f (x)? (2) How is the specic
experimental result related to the random variable, X? (3) For a specic
coin for which p = 0.5 compute the probabilities of obtaining 0, 1, 2 or
3 tails in each experiment. How consistent with the observed results are
the theoretical probabilities?
Solution:
(1) Recall from these previous examples and discussion that the random
variable X is the total number of tails obtained in the 3 tosses; and from
the discussion in Chapter 8, this is a binomial random variable, Bi(n, p)
with n = 3 in this particular case, and p as the characteristic parameter.

Introduction to Statistics
The ensemble description is the binomial pdf:

3 x
f (x|p) =
p (1 p)3x ; x = 0, 1, 2, 3
x

411

(12.1)

from which the probabilities of observing x = 0, 1, 2, 3 can be computed

for any given value of p.
(2) The data set S1 is an experimental realization of the random
variable X; the variability it displays is characteristic of randomly varying phenomena. The specic values observed are expected to change
with each performance of the experiment.
(3) When p = 0.5, the required probabilities are obtained as P (X =
0) = 1/8 = 0.125; P (X = 1) = 3/8 = 0.375; P (X = 2) = 3/8 = 0.375;
P (X = 3) = 1/8 = 0.125. Strictly from the limited data, observe that
two of the 10 values are 0, three values are 1, four values are 2, and
one value is 3; the various relative frequencies of occurrence are therefore fr (0) = 0.2 for the observation X = 0; fr (1) = 0.3; fr (2) = 0.4;
fr (3) = 0.1; and if one can assume that this data set is representative
of typical behavior of this random variable, and that the observed relative frequency of occurrence can be considered as an approximate representation of true probability of occurrence, then the observed relative
frequency distribution, fr (x), is actually fairly close to the theoretical
probabilities computed under the assumption that p = 0.5. The implication is that the data set appears to be somewhat consistent with the
theoretical model when p = 0.5.

Statistical analysis is concerned with such analysis as illustrated above, but

of course with more precision and rigor.

12.1.2

Finite Data Sets and Statistical Analysis

Three concepts are central to statistical analysis: Population, Sample, and

Statistical Inference.
1. Population: This is the complete collection of all the data obtainable
from a random variable of interest. Clearly, it is impossible to realize
the population in actual practice, but as a conceptual entity, it serves
a critical purpose: the population is the full observational realization
of the random variable, X. It is to statistics what the sample space (or
the random variable space) is to probability theory an important
statement we shall expand on shortly.
2. Sample: A specic set of actual observations (measurements) obtained
upon performance of an experiment. By denition, this is a (nite) subset of data selected from the population of interest. It is the only information about the random variable that is actually available in practice.

412

Random Phenomena

3. Statistical Inference: Any statement made about the population on the

basis of information extracted from a sample. Because the sample will
never encompass the entire population, such statements must include a
measure of the unavoidable associated uncertainty a measure of how
reliable such statements are.
These concepts (especially the rst one) require further clarication beyond
these brief descriptions.
Populations, Samples and Inference
Probability theory, with its focus on the random variable, X (and its uncertain outcomes), utilizes the sample space, , and the random variable space,
VX , as the basis for developing the theoretical ensemble description, f (x). In
practice, for any specic problem, the focus shifts to the actual observed data,
{xi }ni=1 ; and the equivalent conceptual entity in statistics becomes the population the observational ensemble to be described and characterized. Now,
while the sample space or the random variable space in probability theory can
be specied `
a-priori and generically, the population in statistics refers to observational data, making it a specic, `
a-posteriori entity. To illustrate, recall
the three-coin toss experiment in Example 12.1. Before any actual experiments are performed, we know VX , the random variable space of all possible
outcomes, to be:
VX = {0, 1, 2, 3},
(12.2)
indicating that with each performance of the experiment, the outcome will be
one of the four numbers contained in VX ; and, while we cannot predict with
certainty what any specic outcome will be, we can compute probabilities of
observing any one of the four possible alternatives. For the generic coin for
which p is the probability of obtaining a tail in a single toss, we are able
to obtain (as illustrated in Example 12.1) an explicit expression for how the
outcome probabilities are distributed over the values in VX for this binomial
random variable. For a specic coin for which, say, p = 0.5, we can compute,
from Eq (12.1), the probabilities of obtaining 0, 1, 2, or 3 tails, as we just
did in Example 12.1 (also see Table 4.1). In practice, the true value of p
associated with a specic coin is not known `
a-priori; it must be determined
from experimental data such as,
S1 = {0, 1, 3, 2, 2, 1, 0, 1, 2, 2},

(12.3)

as in Example 12.1. This is considered as a single, 10-observation sample

for the specic coin in question, one of a theoretically innite number of
other possible samples a sample drawn from the conceptual population
of all such data obtainable from this specic coin characterized by the true,
but unknown, value p = 0.5. Although nite (and hence incomplete as an
observational realization of the binomial random variable in question), S1
contains information about the true value of the characteristic parameter p

Introduction to Statistics

413

associated with this particular coin. Determining appropriate estimates of this

true value is a major objective of statistical analysis. Because of the niteness
of sample data, a second series of such experiments will yield a dierent set
of results, for example,
S2 = {2, 1, 1, 0, 3, 2, 2, 1, 2, 1}.

(12.4)

This is another sample from the same population, and a result of inherent
variability, we observe that S2 is dierent from S1 . Nevertheless, this new
sample also contains information about the unknown characteristic parameter,
p.
Next consider another coin, this time, one characterized by p = 0.8; and
suppose that after performing the three-coin toss experiment say n = 12 times,
we obtain:
(12.5)
S3 = {3, 2, 2, 3, 1, 3, 2, 3, 3, 3, 2, 2}
As before, this set of results is considered to be just one of an innite number
of other samples that could potentially be drawn from the population characterized by p = 0.8; and, as before, it also contains information about the
value of the unknown population parameter.
We may now note the following facts:
1. With probability, the random variable space for this example is nite,
speciable `
a-priori, and remains as given in Eq (12.2) whether p = 0.5,
or 0.8, or any other admissible value.
2. Not so with the population: it is innite, and its elements depend on
the specic value of the characteristic population parameter. Sample S3
above, for instance, indicates that when p = 0.8, the population of all
possible observations from the three-coin toss experiment will very rarely
contain the number 1. (If the probability of observing a tail in a single
toss is this high, the number of tails observed in three tosses will very
likely consistently exceed 1.) This is very dierent from what is indicated
by S2 and S1 , being samples from a population of results observable from
tossing a so-called fair coin with no particular preference for tails over
heads.
3. Information about the true, but unknown, value of the characteristic
parameter, p, associated with each coins population is contained in each
nite data set.
Let us now translate this illustration to a more practical problem.
Consider an in-vitro fertilization (IVF) treatment cycle involving a 35-year
old patient and the transfer of 3 embryos. In this case, the random variable, X,
is the resulting number of live births; and, assuming that each embryo leads
either to a single live birth or not (i.e. no identical twins from the same egg)
the possible outcomes are 0, 1, 2, or 3, just as in the three-coin toss illustration,
with the random variable space, VX , as indicated in Eq (12.2). Recall from

414

Random Phenomena

Chapter 11 that this X is also a binomial random variable, implying that the
pdf in this case is also as given in Eq (12.1).
Now, suppose that this particular IVF treatment resulted in fraternal
twins, i.e. x = 2 (two individual embryos developed and were successfully
delivered). This value is considered to be a single sample from a population
that can be understood in one of two ways: (i) from a so-called frequentist
perspective, the population in this case is the set of all actual IVF treatment
outcomes one would obtain were it possible to repeat this three-embryo transfer treatment an innite number of times on the same patient; (ii) The population could also be conceived equivalently as the collection of the outcomes
of the same three-embryo IVF treatment carried out on an innite number of
identically characterized patients. In this regard, the 10-observation sample
S2 would result from sampling nine more of such patients treated identically, whose pregnancy outcomes are 1, 1, 0, 3, 2, 2, 1, 2, 1, in addition to the
already noted outcome x = 2 from the rst patient.
With this more practical problem, as with the coin-toss experiments, the
pdf is known, but the parameter p is unknown; data is available, but in the
form of nite-sized samples, whereas the full ensemble characterization we
seek is of the entire population. We are left with no other choice but to use
the samples, even though nite in size, to characterize the population. In these
illustrations, this amounts to determining, from the sample data, a reasonable
estimate of the true but unknown value of the parameter, p, for the specic
problem at hand.
Observe therefore that in solving practical problems, the population the
full observational manifestation of the random variable, X is that ideal,
conceptual entity one wishes to characterize. The objective of random phenomena analysis is to characterize it completely with the pdf f (x|), where
represents the parameters characteristic of the specic population in question.
However, because it is impossible to realize the population in its entirety via
experimentation, one must therefore settle for characterizing it by drawing
statistical inference from a nite sample subset. Clearly, the success of such
an endeavor depends on the sample being representative of the population.
Statistics therefore involves not only systematic analysis of (necessarily nite)
data, but also systematic data collection in such a way that the sample truly
reects the population, thereby ensuring that the sought-after information
will be contained in the sample.

12.1.3

Probability, Statistics and Design of Experiments

We are now in a position to connect all these concepts into a coherent

whole as illustrated in Fig 12.1. Associated with every random variable, X,
is a pdf, f (x|), consisting of a functional form determined by the underlying
random phenomenon and a set of characteristic parameters, . In general,
given the complete f (x|), probability theory provides the tools for computing the probabilities of observing various occurrences {xi }ni=1 . For any specic

Introduction to Statistics

415

PROBABILITY
Population
Sample

f(y|T)

{yi}iN
EXPERIMENTAL
DESIGN
STATISTICS

FIGURE 12.1: Relating the tools of Probability, Statistics and Design of Experiments
to the concepts of Population and Sample
problem, however, only a nite set of observations {xi }ni=1 is available; the
specic characterizing parameter vector, , is unknown and must rst be determined before probabilistic analysis can be performed. By considering the
available nite set of observations as a sample from the (idealized) complete
observational characterization of the random variable X, i.e. the population,
the tools of Statistics make it possible to characterize the population fully
(determine the functional form and estimate the unknown parameter in the
pdf f (x|)) from information contained in the sample.
Finally, because the population is to be characterized on the basis of the
nite sample, Design of Experiment provides the tools for ensuring that the
sample is indeed representative of the population so that the information
required to characterize the population adequately is contained in the sample.

12.1.4

Statistical Analysis

It is customary to partition Statistics into two categories

1. Analyzing data (sometimes referred to as Descriptive Statistics), and,
2. Drawing inference from data (Inductive or Inferential Statistics),
although it is clear from above that basic to these partitions in a fundamental
manner is a third category:
3. Generating data (Statistical Design of Experiments).

416

Random Phenomena

Descriptive Statistics
In descriptive statistics, the primary concern is the presentation, organization and summarization of sample data, with emphasis on the extraction of
information contained in a specic sample. Because the issue of generalization
from sample to population does not arise, no explicit consideration is given to
whether or not the sample is representative of the population. There are two
primary aspects:
1. Graphical : The use of various graphical means of organizing, presenting
and summarizing the data;
2. Numerical : The use of numerical measures, and characteristic values to
summarize data.
Such an approach to data analysis is often referred to as Exploratory Data
Analysis (EDA)1 .
Inductive Statistics
Also known variously as Statistical Inference or Inferential Statistics,
it is primarily concerned with drawing inference about the population from
sample data. As such, the focus is on generalization from the current data
set (sample) to the larger population. The two primary aspects of inductive
statistics are:

1. Estimation: Determining from sample data appropriate values for one or

more unknown parameters of the assumed population description, the
pdf;
2. Hypothesis Testing: Testing the validity of the assumed population
model and the reliability of the parameter estimates.

Statistical Design of Experiments

Whether for descriptive or inductive purposes, how sample data sets are
obtained will aect the information they contain, and hence will inuence
our ability to generalize such information to the population with condence.
Design of Experiments (DoE) involves methods for ecient experimental procedures, data collection and analysis so that the sample may be considered as
adequate. The assumptions underlying statistical analysis are thus rendered
reasonable so that the extracted information may be as broadly applicable as
possible.
1 J.

Tukey, (1977) Exploratory Data Analysis, Addison-Wesley.

Introduction to Statistics

417

The rest of this chapter is devoted to an overview treatment of Descriptive

Statistics; a detailed discussion of Inductive Statistics follows in Chapters
13-18; and the central concepts and methodologies of Design of Experiments
are discussed in Chapter 19. Our treatment of Statistics concludes with case
studies in Chapter 20.

12.2

Variable and Data Types

Before embarking on a study of statistical techniques, it is important rst

to have a working knowledge of variable and data types.
Variables (and the data associated with them) are generally of two types:
1. Quantitative Variables: are numerical in nature, generating observations
in the form of numbers that can be real-valued, (in which case the variable as well as the data are said to be continuous), or integers, in which
case they are said to be discrete. The yield variables and data of Chapter
1 are continuous; the inclusions variable and data are discrete.
2. Qualitative Variables: are non-numeric and hence non-quantitative; the
associated data are non-numeric observations. These variables tend to be
descriptive, employing text, words, or symbols to convey a sense of general meaning rather than a precise measure of a quantity. Qualitative
variables are sometimes referred to as Categorical variables because
they typically describe categories: for example, if the variable is the type
of an object under study, then the data, {Catalyst A, Catalyst B, Catalyst C} or {Fertilizer, No-Fertilizer} are qualitative; if the variable is the
color of an object, then {Red, Green, Blue, Yellow} is a possible data
set. The variable opinion regarding the quality of a product, may be
associated with such entities as {Poor, Fair, Good, Better, Best}.
To be sure, most scientic investigations involve quantitative variables and
data almost exclusively; but by no means should this be construed as implying
that qualitative data are therefore unimportant in Science and Engineering.
Apart from studies involving qualitative variables directly (Drug A versus
a Placebo; Process A, or Process B; Machine 1, Machine 2 and Machine 3),
naturally quantitative data are sometimes deliberately represented in qualitative form when this is required by the study. For example, even though one can
quantify the Ultimate Tensile Strength (UTS) of a class of polymer resins
with a precise numerical value (measured in MPa), it is actually quite common for a customer who is purchasing the product for further processing to
classify the product simply as Acceptable if 8 MPa < U T S < 10 MPa and
Unacceptable, otherwise. The quantitative UTS data has been converted to
qualitative data consisting of two categories. Similarly, while a quantitative

418

Random Phenomena

observation of the scores on an examination of 3 college roommates can be

recorded as the set of numbers {98%, 87%, 63%}, a qualitative record of the
same information may be presented as {High, Medium and Low} for each of
the scores respectively, if the study is concerned more with categorizing scores
than with quantifying them numerically.
The converse is also true: intrinsically qualitative data can sometimes be
represented with numerical values. For example, opinion surveys often involve presenting a statement (such as:Chemical engineering professors have
no sense of humor ) to various individuals, each of whom is then asked to
state his/her opinion by selecting from the following options: {1=Strongly
Agree; 2=Agree; 3=Dont Know; 4=Disagree; 5=Strongly Disagree}. The information gathered is intrinsically qualitative (a record of subjective opinion),
and the assignment of the integers 15 is somewhat arbitrary; the same objective could have been achieved by assigning the numbers 2, 1, 0, 1, 2, or
any other set of 5 distinct numbers. This brings us to another set of terms
used to classify data.
1. Nominal data: have no order; they merely provide names or some other
such identifying labels to various categories. For example, the set of
weather conditions {Drizzle, Hail, Rain, Sleet, Snow, Thunderstorm} is
nominal, as is the following set of manufactured personal care products
{Soaps, Deodorant, Toothpaste, Shampoo}. There is no order implied
or intended in such listings.
2. Ordinal data: have order, but the interval between each entry is meaningless. For example, the familiar classication {Small, Medium, Large}
is ordinal ; it is understood to indicate an increase in magnitude from the
rst to the last, but the dierence between Small and Medium or
that between Medium and Large is unspecied, neither is there any
intention to indicate that one dierence is of the same magnitude as the
other. Similarly, the set of opinion poll options given above, {1=Strongly
Agree; 2= Agree; 3=Dont Know; 4=Disagree; 5=Strongly Disagree} is
ordinal, indicating a generally declining level of agreement (equivalently, an increasing level of disagreement) of the subject with the
validity of the statement in question. Note that the assigned numbers
are meant to indicate no more than simple order: there is no intention
that the distance between one entry and the other means anything.
Finally, it must be noted that while such nominal/ordinal classications are
valid in general, unusual exceptions sometimes arise in special elds of study.
For example, the set of colors {Red, Orange, Yellow, Green, Blue, Indigo,
Violet} is, under normal circumstances, entirely nominal; in Physics (Optics,
specically), however, this set is in fact ordinal, indicating the order of the
components of visible light. The context in which such categorical data is
presented usually makes the appropriate classication clear.

Introduction to Statistics

419

TABLE 12.1:

Number and Type of

injuries incurred by welders in the USA from
1980-1989
Type of Injury Total Number Percent
(%)
Eye
242
40.3
130
21.7
Hand
64
10.7
Arm
Back
117
19.5
Other
47
7.8

12.3
12.3.1

Graphical Methods of Descriptive Statistics

Bar Charts and Pie Charts

Consider the data shown in Table 12.1, a compilation of the number of

recordable injuries incurred by members of the Welders Association of America, during the decade from 1980-1989, categorized by type. The variable
Type of Injury is categorical, and hence qualitative; in addition, it is nominal because there is no particular order to the list. The other two variables,
Total Number and Percent are quantitative: Total number is discrete
since it is a count; Percent is continuous by virtue of being a ratio of integers.
This data set can be represented graphically several dierent ways. A bar
chart (or bar graph) is a means by which the numbers associated with each
category are represented by rectangular bars whose heights are proportional
to the respective numbers. Such a chart is shown in Fig 12.2 for the total
number of injuries associated with each injury type. This is a vertical bar
chart because the bars are oriented vertically, as opposed to horizontal charts
where the bars are oriented horizontally.
Note that because the Type variable is ordinal, there is no particular
order in which the data should be represented. Under these circumstances,
to avoid the somewhat haphazard impression one gets from Fig 12.2, it is
often recommended to order the bars in some meaningful manner, typically
by ranking the plotted values in increasing or decreasing order. When the
values in the default Fig 12.2 plot are ranked in decreasing order, one obtains
Fig 12.3. One advantage of such a rearrangement is that should there arise an
interest in instituting a program to prevent injuries, and one can only focus on
a few injury types at a time, the gure provides an easy visual representation
that can be used objectively to convey the logic behind the prioritization. For
example, the gure indicates that Eye, Hand, and Back injuries make up most
of the injuries, with the majority contributed by Eye injuries so that if there
are resources for tackling only one injury, the logical choice is obvious.
This sort of analysis involving the relative contribution of various cat-

420

Random Phenomena

250

Total Number

200

150

100

Eye

Hand

Arm
Type of Injury

Back

Other

FIGURE 12.2: Bar chart of welding injuries from Table 12.1

250

Total Number

200

150

100

Eye

Hand

Back
Type of Injury

Arm

Other

FIGURE 12.3: Bar chart of welding injuries arranged in decreasing order of number
of injuries

Introduction to Statistics

421

600

100

400
60
300
40

Percent

Total Number

500

200
20

100
0
Type of Injury
Total Number
Percent
Cum %

Eye
242
40.3
40.3

Hand
130
21.7
62.0

Back
117
19.5
81.5

Arm
64
10.7
92.2

Other
47
7.8
100.0

FIGURE 12.4: Pareto chart of welding injuries

egories (or factors) to a variable of interest is known as a Pareto Analysis,

named after Vilfredo Federico Pareto (18481923), the Italian economist and
sociologist best known for his discovery that 80% of Italys wealth was owned
by 20% of the population. This observation has since been generalized to other
phenomena in which 80% of some category of eects is due to 20% of the
causes. It has become particularly useful in industrial quality assurance where
the primary concern is to determine the vital few causes of poor quality
upon which to concentrate quality improvement programs.
When a rank-ordered bar chart of the sort in Fig 12.3 is accompanied by
a plot of the cumulative total contribution from each category, the result is
known as a Pareto chart. Such a chart for the welding injury data is shown in
Fig 12.4.
An alternative to using bars to represent frequencies of categorical data
is the Pie chart, a graphical technique that is very popular in journalism
print and electronic and in business. With a pie chart, the categories
are represented as wedges of a pie or sectors of a circle, whose areas (or
equivalently, arc lengths or central angles) are proportional to the values (or
relative frequencies) associated with each category. For example, the pie chart
for the welding injury data is shown in Fig 12.5.
Several variations of this basic chart form exist (exploded pie chart;
perspective pie charts, etc) but we will not discuss any of these here. The pie
chart is used less often in engineering and science primarily because comparing
areas visually is more dicult than comparing lengths. (The items in Table
12.1 are easier to compare visually in Fig 12.2 than in Fig 12.5.) It is also
more dicult to compare information in dierent pie charts as opposed to the
same information presented in dierent bar charts. The pie chart is best used

422

Random Phenomena
Category
Ey e
Hand
Arm
Back
other

other
7.8%

Back
19.5%
Ey e
40.3%

A rm
10.7%

Hand
21.7%

FIGURE 12.5: Pie chart of welding injuries

TABLE 12.2:

Frozen Ready
meals in France, in 2002
Type of Food

Percent

French Regional Meals

Cooked Fish
Pasta Based Meals
European Meals
Side Dishes
Cooked Seafood
Exotic
Cooked Snails

29.5
25.0
16.6
9.4
6.8
6.3
4.7
1.7

when one is primarily concerned with comparing a particular category with

the entire group (for example, that eye injuries contribute the largest to the
collection of welding injuries is very clearly depicted in Fig 12.5).
The following example uses actual data to illustrate the strengths and
weaknesses of the bar and pie charts.
Example 12.2 COMPARING BAR CHARTS AND PIE
CHARTS: FROZEN READY MEALS SOLD IN FRANCE IN
2002
The data in Table 12.2, from the Global Foods Almanac, October, 2002,
summarizes 2002 sales information about frozen ready meals in France.
Generate a bar chart and a pie chart of the data and briey compare
the charts.

Introduction to Statistics

423

30
25

Percent

20
15
10
5
0

ch
en
Fr

l
na
io
eg

ea
M

ls
ed
ok
Co

sh
Fi

a
st
Pa

d
se
Ba

ea
M

an
pe
ro
Eu

ea
M

ls
de
Si

es
sh
Di

ed
ok
Co

d
oo
af

ic
ot
Ex
o
Co

d
ke

ls
ai
Sn

Type of Food

FIGURE 12.6: Bar Chart of frozen ready meals sold in France in 2002

Solution:
The bar chart is shown in Fig 12.6, the pie chart in Fig 12.7. Keeping in
mind that they are primarily used for visual communication of numerical facts, here are the most salient aspects of these charts: (i) It is not
very easy to see from the pie chart wedges (without the added numbers)
which category, French Regional Meals, or Cooked Fish, accounts for the
larger proportion of the overall sales, since both wedges look about the
same in size; with the bar chart, there is no question, even if the bars in
question were not situated side-by-side. (ii) However, if one is familiar
with reading pie charts, it is far easier to see that the Cooked Fish
category accounts for precisely a quarter of the total sales (the right
angle subtending the Cooked Fish wedge is the key visual cue); this
fact is not at all obvious from the bar chart which is much less eective
at conveying relative-to-the-whole information. (iii) French regional
meals sold approximately 20 times more than Cooked snails; even with
the attached numbers, this fact is much easier to appreciate in the bar
chart than in the pie chart.

Thus, observe that while the pie chart excels at conveying relative-to-thewhole information (especially if the relative proportions in question are 25%,
50%, or 75% entities whose angular representations are easily recognizable
by the unaided eye), the bar chart is weak in this regard. Conversely, the bar
chart conveys relative-to-one-another information far better than the pie
chart.

424

Random Phenomena

C ook ed Snails
Exotic 1.7%
Cook ed Seafood
6.3%

4.7%

French Regional Meals

29.5%

Side Dishes
6.8%

C ategory
French Regional Meals
C ook ed Fish
Pasta Based Meals
European Meals
Side Dishes
C ook ed Seafood
Exotic
C ook ed Snails

European Meals
9.4%

Pasta Based Meals

16.6%
C ook ed Fish
25.0%

FIGURE 12.7: Pie Chart of frozen ready meals sold in France in 2002

12.3.2

Frequency Distributions

As previewed in Chapter 1, even quantitative data can be quite uninformative if presented in raw form, as numbers in a table. One of the rst steps
in making sense of the information contained in raw quantitative data is to
rearrange the data, dividing them into smaller groups, or classes (also known
as bins) and attaching to each group a number representing how many of
the raw data belong to that group (i.e. how frequently members of that group
occur in the raw data set). The result is a frequency distribution representation of the data. When the frequency i associated with each group i
is normalized by the total number of data points, n, in the sample set, we
obtain the relative frequency, fi , for each group. For example, a specic
reorganization of the yield data, YA , presented in Chapter 1 gives rise to the
frequency distribution shown in Table 1.3 of Chapter 1 and reproduced here
(in Table 12.3) for easy reference. Compared to the raw data in Table 1.1,
this is a more compact and more informative representation of the original
data. Of course, such compactness is achieved at the expense of some details,
but this loss is more than compensated for by a certain enhanced clarity with
which the true character of the random variation begins to emerge from this
frequency distribution. For example, the frequency distribution shows clearly
how much of the data clusters around the group [74.51-75.50], an important
characteristic that is not readily evident from the raw data table.
A plot of this frequency distribution using vertical bars whose heights
are proportional to the frequencies (or, equivalently, the relative frequencies)
is known as a histogram, with the one corresponding to the YA frequency
distribution shown in Fig 1.1 and repeated here (in Fig 12.8) for ease of
reference.

Introduction to Statistics

425

TABLE 12.3:

Group classication
and frequencies for YA data
Relative
YA group Frequency Frequency
i
fi
71.51-72.50
1
0.02
2
0.04
72.51-73.50
9
0.18
73.51-74.50
74.51-75.50
17
0.34
7
0.14
75.51-76.50
8
0.16
76.51-77.50
77.51-78.50
5
0.10
1
0.02
78.51-79.50
TOTAL

1.00

Histogram of YA
18
16
14

Frequency

12
10
8
6
4
2
0

FIGURE 12.8: Histogram for YA data of Chapter 1

426

Random Phenomena

Although by far the most popular, the histogram is not the only graphical
means of representing frequency distributions. If, instead of using adjoining
bars to represent group frequencies, we employed a cartesian plot in which each
group is represented by its center value on the x-axis and the corresponding
frequency plotted on the y-axis, with the points connected by straight lines,
the result is known as a frequency polygon as shown in Fig 12.9 for the YA
data. This is only a slightly dierent rendering of the information contained
in the histogram. Fig 12.10 shows the corresponding frequency polygon for
the YB data of Chapter 1 (whose histogram is shown in Fig 1.4).
As alluded to in Chapter 1, such graphical representations of the data
provide an empirical and partial, as opposed to complete approximation
to the true underlying distribution; but they show features not immediately
apparent from the raw data. Because data sets are incomplete samples, the
corresponding frequency distributions show irregularities; but, as the sample
size n , these irregularities gradually diminish, so that these empirical
distributions ultimately approach the complete population distribution of the
random variable in question. These facts inform the frequentist approach
to statistics and data analysis.
Some nal points to note about frequency distributions: It should be clear
that to be meaningful, histograms must be based on an ample amount of data;
only then will there be a sucient number of groups, with enough members
per group, to display the data distribution meaningfully. As such, whenever
possible, one should avoid employing histograms to display data sets containing fewer than 15-20 data points.
It should also be clear that the choice of bin size will aect the general appearance of the resulting histogram. Bins that are too wide generate
fewer groups and the resulting histograms cannot adequately reveal the true
distributional characteristics of the data. As an extreme example, a bin size
covering the entire range of a data set containing a total of n data points will
produce a histogram consisting of a single vertical bar of height n. Of course,
this is totally uninformative at least no more informative than the raw data
table because the entire data set will remain conned to this one group.
On the other extreme, if the bin size is so small that, with the exception of
exactly identical data entries, each data point ts into a group all by itself,
the result is an equally uninformative histogram uninformative for a complementary reason: this time, the entire data set is stretched out horizontally
into a collection of n bars, all of the same unit height. Somewhere between
these two obviously untenable extremes lies an acceptable bin size, but there
are no hard-and-fast rules for choosing it. The rule-of-thumb is that any choice
resulting in 8 10 groups is considered as acceptable.
In practice, transforming raw data sets into frequency distributions and
the accompanying graphical representations is almost always carried out by
computer programs such as MINITAB, Matlab, SAS, etc. And these software
packages are preprogrammed with algorithms that automatically choose reasonable bin sizes for each data set. The traditional recommendation for the

Introduction to Statistics

427

Frequency Polygon of YA
18
16
14

Frequency

12
10
8
6
4
2
0

FIGURE 12.9: Frequency Polygon of YA data of Chapter 1

number of intervals, k, to use in representing a data sample of size n is the
following, from Sturges, 19262,
k = 1 + 3.3 log10 n

(12.6)

Thus, for instance, for the yield data, with n = 50, the recommendation
will be 7 intervals. The histogram in Fig 12.8, generated automatically with
MINITAB, uses 8 intervals.

12.3.3

Box Plots

Also known as a box-and-whisker plot, the box plot was proposed in

1977 by the American statistician John Tukey (1915-2000) as a means of
presenting, in a single compact plot, the following ve key characteristics of a
data set:
1. The minimum (smallest valued observation);
2. The lower (or rst) quartile, Q1 , (25% of the data values are less than
or equal to this value);
3. The median, or middle value, (50% of the data values are less than or
equal to this value, and 50% are greater than or equal to it);
4. The upper (or third) quartile, Q3 , (75% of the data values are less than
or equal to this value; equivalently, 25% of the data are greater than or
equal to this value);
2 Sturges,

H.A., (1926). The choice of a class interval, J. Am. Stat. Assoc., 21, 65-66.

428

Random Phenomena
Frequency Polygon of YB
9
8
7

Frequency

6
5
4
3
2
1
0

FIGURE 12.10: Frequency Polygon of YB data of Chapter 1

5. The maximum (largest valued observation).
These items are also known as the ve-number summary of a data set. What
gives the plot its name is how these 5 characterizing quantities are depicted
in the form of a rectangular box and two whiskers extending from the two
ends, as described below:
When oriented vertically, the bottom of the box represents the rst quartile, Q1 , the top of the box is the third quartile, Q3 , while a middle line inside
the box represents the median. This vertical box therefore encompasses the
lower and upper quartiles of the data set so that its length is the interquartile
range, IQR = Q3 Q1 .
How the data minimum and maximum are depicted in this graphical representation is another of its attractive features. Based on a supposition in normally distributed data sets, extreme values (minimum and maximum) hardly
fall outside a region that is 1.5 times the interquartile range from the lower or
upper quartile, box plots employ an upper and a lower limit dened as follows:
U L = Q3 + 1.5(Q3 Q1 )
LL = Q1 1.5(Q3 Q1 )

(12.7)
(12.8)

With this denition, the lower whisker is drawn as a line extending from the
bottom of the box (i.e from Q1 ) to the smallest data value so long as it falls
within the lower limit. In this case, the end of the bottom whisker is therefore
the data minimum. Any data value that falls outside the lower limit is agged
as a potential outlier in this case, an unusually small observation and
represented with an asterisk. The upper whisker is drawn similarly: from the
top of the box, Q3 , to the largest data value within the upper limit. All data

Introduction to Statistics

429

80
78
76

Data

74
72
70
68
66
YA

FIGURE 12.11: Boxplot of the chemical process yield data YA , YB of Chapter 1

values exceeding the upper limit are also agged as outliers unusually large
observations in this case and also represented with an asterisk.
Figure 12.11 shows box plots for the chemical process yield data sets YA
and YB introduced in Chapter 1. Observe how box plots are particularly adept
at displaying data sets visually without overwhelming detail. In one compact
plot, they show clearly the data range, symmetry, central location, and concentration around the center. They are also good for quick visual comparisons
of data sets. Thus, for example, the details contained in the yield data sets
are compressed succinctly in these plots, showing, among other things, that
both data sets are essentially symmetric about their respective centers which
is higher for YA than for YB ; the minimum for YB is substantially lower than
that for YA ; in fact, the central 50% of the YA data, the rectangular box
on the left, is much more compact and entirely higher than the corresponding
central 50% for YB . Also the overall comparative compactness of the YA
plot indicates visually how this data set is less variable overall than YB . Of
course, all of this information is available from the histogram, but while a
relatively large number of observations are required in order for a histogram
to be meaningful, a data set consisting of as little as 5 observations can be
represented meaningfully in a box plot. Thus, while box plots are useful for
all quantitative data, they are especially good for small data sets.
Figure 12.12 shows on the left a box plot of 30 observations from a random
variable Y N (0, 1); shown on the right is the same data set with a single
addition of the number 3.0 as the 31st observation. Note that this value is
agged as unusually high for this data set.
Example 12.3 RAISIN-DISPENSING MACHINES
In a study to determine the performance of 5 dierent processing ma-

430

Random Phenomena

1
0

0
-1

-1

-2

FIGURE 12.12: Boxplot of random N(0,1) data: original set, and with added outlier

TABLE 12.4:

Number of raisins dispensed into trial-sized

Raising Bran cereal boxes by ve dierent machines
Machine 1 Machine 2 Machine 3 Machine 4 Machine 5
27
17
13
7
15
21
12
7
4
19
24
14
11
7
19
15
7
9
7
24
33
14
12
12
10
23
16
18
18
20

chines used to add raisins to a trial-size Raisin Bran cereal boxes, 6

sample boxes are taken at random from each machines production line
and the number of raisins in each box counted. The result is displayed
in Table 12.4.
Because the number of samples for each machine is so few, individual histograms of each machines data will not be meaningful. Instead,
obtain ve individual box plots for this data set. What do these plots
suggest about the equality of how these machines dispense raisins into
each box?
Solution:
The box plots are shown in Fig 12.13, from which it appears as if there
are some noticeable dierences in how this group of machines dispense
raisins. In particular, the plots seem to suggest that Machine 1 (and
possibly Machine 5) may be dispensing more raisins than the others,
while the other three machines appear to be similar in the way they
dispense raisins.
These somewhat informal and descriptive statements can be
made more rigorous and quantitative using techniques for drawing precise statistical inferences about the signicance (or otherwise) of any
observed dierences. Such techniques are discussed in greater detail in
Chapter 15.

Introduction to Statistics

431

35
30
25

Data

20
15
10
5
0
Machine 1

Machine 2

Machine 3

Machine 4

Machine 5

FIGURE 12.13: Box plot of raisins dispensed by ve dierent machines

12.3.4

Scatter Plots

When the values of one variable are plotted on the y-axis versus the values
of another variable on the x-axis, the result is known as a scatter plot. The
plot is so-called because, unless the one variable is perfectly correlated with
the other, the data appear scattered in the plot. Such plots provide visual
clues as to whether or not there truly is a relationship between the variables,
and if there is one, what sort of relationship strong or weak, linear or
nonlinear, etc. Although not necessarily always the case, the variable plotted
on the y-axis is usually the one that may potentially be responding to the
other variable, which will be plotted on x-axis. It is also possible that a causal
relationship may not exist between the plotted variables, or if there is one,
it may not always be clear which variable is responding and which is causing
the response. Because these plots are truly just exploratory, care should be
taken not to over-interpret them; it is especially important not to jump to
conclusions that any observed apparent relationship implies causality.
A series of scatter plots are shown below, beginning with Fig 16.6, a plot
of the cranial circumference and corresponding nger length for various individuals. Believe it or not, there once was a time when people speculated
that these two variables correlate. This plot shows that, at least for the individuals involved in the particular study generating the data, there does not
appear to be any clear relationship between these two variables. Even if the
plot had indicated a strong relationship between the variables, observe that in
this case, none of these two variables can be rightly considered as dependent
on the other; thus, the choice of which variable to plot on which axis is purely
arbitrary.
Next, consider the data shown in Table 12.5, city and highway gasoline

432

Random Phenomena
61

Cranial Circumference(cm)

60
59
58
57
56
55
54
53
52
7.50

7.75

8.00
8.25
Finger Length(cm)

8.50

8.75

FIGURE 12.14: Scatter plot of cranial circumference versus nger length: The plot
shows no real relationship between these variables

mileage ratings, in miles per gallon (mpg), for 20 types of two-seater automobiles, complete with engine characteristics, capacity (in liters) and number of
cylinders. First, a plot of city gas mileage versus highway gas mileage, shown
in Fig 12.15, indicates a very strong, positive linear relationship between these
two variables. However, even though related, it is clear that this relationship is
not causal in the sense that one cannot independently and directly manipulate say city gas mileage and as a direct consequence thereby cause highway
gas mileage to change. Rather, both variables depend in common on other
factors that can be independently and directly manipulated (e.g., engine capacity).
Fig 12.16 shows a plot of highway gas mileage against engine capacity,
indicating an approximately linear and negative relationship. Observe that
according to the thermodynamics of internal combustion engines, the physics
of work done by applying force to move massive objects, and the fact that
larger engines are normally required for bigger cars, it is entirely logical that
smaller engines correlate with higher highway gas mileage. Fig 12.17 shows
a corresponding plot of highway gas mileage versus the number of cylinders.
This plot also indicates a similar negative, and somewhat linear relationship
between the variables. (Because city and highway gas mileage values are so
strongly correlated, similar plots of the city mileage data should show characteristics similar to the corresponding highway mileage plots. See Exercise
12.15)
Scatter plots are also very good at pointing out data that might appear
inconsistent with others in the group. For example, in Fig 12.16, two data
points for engine capacities 7 liters and 8.4 liters are associated with highway

Introduction to Statistics

433

TABLE 12.5:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Gasoline mileage ratings for a collection of two-seater automobiles

Car
Eng Capacity # Cylinders City Highway
Type and Model
(Liters)
mpg
mpg
Aston Marton V8 Vantage
4.3
8
13
20
Audi R8
4.2
8
13
19
Audi TT Roadster
2.0
4
22
29
BMW Z4 3.0i
3.0
6
19
28
BMW Z4 Roadster
3.2
6
15
23
Bugatti Veyron
8.0
16
8
14
Caddilac XLR
4.4
8
14
21
Chevrolet Corvette
7.0
8
15
24
Dodge Viper
8.4
10
13
22
Ferrari 599 GTB
5.9
12
11
15
Honda S2000
2.2
4
18
25
Lamborghini Murcielago
6.5
12
8
13
Lotus Elise/Exige
1.8
4
21
27
Mazda MX5
2.0
4
21
28
Mercedes Benz SL65 AMG
6.0
12
11
18
Nissan 350Z Roadster
3.5
6
17
24
Pontiac Solstice
2.4
4
19
25
Porsche Boxster-S
3.4
6
18
25
Porsche Cayman
2.7
6
19
28
Saturn SKY
2.0
4
19
28

22
20

MPG City

18
16
14
12
10
8
15

20
MPG Highway

FIGURE 12.15: Scatter plot of city gas mileage versus highway gas mileage for various
two-seater automobiles: The plot shows a strong positive linear relationship, but no
causality is implied.

434

Random Phenomena

MPG Highway

4
5
6
Engine Capacity (Liters)

FIGURE 12.16: Scatter plot of highway gas mileage versus engine capacity for various
two-seater automobiles: The plot shows a negative linear relationship. Note the two unusually high mileage values associated with engine capacities 7.0 and 8.4 liters identied
as belonging to the Chevrolet Corvette and the Dodge Viper, respectively.

MPG Highway

10
Cylinders

FIGURE 12.17: Scatter plot of highway gas mileage versus number of cylinders for
various two-seater automobiles: The plot shows a negative linear relationship.

Introduction to Statistics

435

300

US Population

250
200
150
100
50
0
1800

1850

1900
Census Year

1950

2000

FIGURE 12.18: Scatter plot of US population every ten years since the 1790 census
versus census year: The plot shows a strong non-linear trend, with very little scatter,
indicative of the systematic, approximately exponential growth

gas mileage values of 24 and 22 miles per gallon, respectively values that
appear unusually high for such large engines, especially when compared to corresponding values for other automobiles with engines of similar volume. An
inspection of the data table indicates that these values belong to the Chevrolet Corvette and the Dodge Viper models respectively automobiles whose
bodies are constructed from special berglass composites, resulting in vehicles
that are generally lighter in weight than others in their class. The scatter plot
correctly shows these data to be inconsistent with the rest, and we are able
to provide a logical reason for the unusually high gas mileage: lighter cars,
even those with large engines, generally get better gasoline mileage.
When the variable on the x-axis is time, the plot provides an indication
of any trends that may exist in the y variable. Examples of such plots include
monthly sales volume of a particular item; daily closing values of stocks on the
stock exchange; monthly number of drunk driving arrests on a municipal road,
etc. These are all plots that indicate time trends in the variables of interest.
Fig 12.18 shows a plot of the populations of the United States of America
as determined by the decade-by-decade census from 1790 to 2000. This plot
shows the sort of exponential growth trend that is typical of populations of
growing organisms. We revisit this data set in Chapter 20.

436

12.4

Random Phenomena

Numerical Descriptions

Characterizing data sets by empirical frequency distributions, while useful for graphically condensing the information contained in the data into a
relatively small number of groups, is not particularly useful for carrying out
quantitative comparisons. Such comparisons require quantitative numerical
descriptors of the data characteristics, typically measures of (i) central tendency (or data location); (ii) variability (or spread); (iii) skewness, and (iv)
peakedness. It is not coincidental that these common numerical descriptors
align perfectly with the characteristic moments of theoretical distributions
discussed in Chapter 4. In statistical analysis, these numerical descriptors are
computed from sample data as single numbers used to represent various aspects of the entire data set; they are therefore numerical approximations of the
corresponding true but unknown population distribution parameters. Given
sample data, all such numerical descriptors are, of course, routinely computed
by various statistical analysis software packages; the following discussion simply provides some perspective on the most common of these descriptors. In
particular, we demonstrate the sense in which they are to be considered as
appropriate measures, and hence clarify the context in which they are best
utilized.

12.4.1

Theoretical Measures of Central Tendency

Given a sample x1 , x2 , . . . , xn , obtained as experimental observations of a

random variable, X, we wish to consider how best to choose a number, c, to
represent the central location of this random variable and its n observed
realizations. Because X is a random variable, it seems entirely reasonable to
choose this number such that the expectation of some function of the deviation
term (X c) is minimized. Let us therefore dene
n = E [|X c|n ]

(12.9)

and seek values of c that will minimize n for various values of n.

The Mean
When n = 2, the objective function to be minimized in Eq (12.9) becomes
!
"
2 = E (X c)2
(12.10)
which expands out to give:
2

"
!
E X 2 2Xc + c2

E[X 2 ] 2cE[X] + c2

(12.11)

Introduction to Statistics

437

because c is a constant. Employing the standard tools of calculus dierentiating in Eq (12.11) with respect to c, setting the result to zero and solving
for c yields:
d2
= 2E[X] + 2c = 0
(12.12)
dc
giving the immediate (and not so surprising) result that
c = E[X] =

(12.13)

Thus, the mean is the best single representative of the theoretical centroid of
a random variable if we are concerned with minimizing mean squared deviation
from all possible values of X.
The Median
When n = 1 in Eq (12.9), we wish to nd c to minimize the mean absolute
deviation between it and the possible values of the random variable, X. For
the continuous random variable, by denition,

|x c|f (x)dx
(12.14)
1 =

so that

1
=
c

|x c|f (x)dx = 0
c

(12.15)

is the equation to be solved to nd the desired c. Now, because |x c| is

discontinuous at the point x = c, the indicated dierentiation must be carried
out with caution; in particular, we note that

1; x > c
|x c| =
(12.16)
1; x < c
c
As a result, Eq (12.15) becomes:
c

0=
f (x)dx

f (x)dx

(12.17)

where the rst term represents the integral over the region {x : x < c} and
the second term is the integral over the remaining region {x : x > c}. This
equation has the obvious solution:

c
f (x)dx =
f (x)dx,
(12.18)

which is either immediately recognized as the denition of c as the median of

the pdf f (x), or, by explicitly introducing the cumulative distribution function,
F (x), reduces to
F (c) = 1 F (c),
(12.19)

438

Random Phenomena

which now yields

F (c) = 0.5; c = xm

(12.20)

where xm is the median. It is left as an exercise for the reader (see Exercise
12.18) to establish the result for discrete X.
Thus, the median is the central representative of X that gives the smallest
mean absolute deviation from all possible values of X.
The Mode
It is shown in the Appendix at the end of this chapter that when n = ,
c = x

(12.21)

where x is the mode of the pdf f (x), minimizes the objective function in Eq
(12.9). The mode is therefore the central representative of X that provides
the smallest of the worst possible deviations from all possible values of X.
This discussion puts into perspective the three most popular measures
of central location of a random variable the mean, median, and mode
their individual theoretical properties and what makes each one a good
measure. Theoretically, for all symmetric distributions, the mean, mode and
median all coincide; they dier (sometimes signicantly) for nonsymmetric
distributions. The sample data equivalents of these population parameters
are obtained as discussed next.

12.4.2

Measures of Central Tendency: Sample Equivalents

Sample Mean
From a sample x1 , x2 , x3 , . . . , xn , the sample mean, or the sample average,
x
, is dened as:
n
1
x
=
xi
(12.22)
n i=1
In terms of the just-concluded theoretical considerations, this implies that of
all possible candidate values, c, the sample average, is therefore that value
which minimizes the mean squared error between the observed realizations of
X and c. This quantity is sometimes referred to as the arithmetic mean to
distinguish it from other means. For example, the geometric mean, x
g , dened
as
n 1/n

x
g =
xi
,
(12.23)
i=1

is sometimes preferred for representing the centrum of data from skewed distributions such as the lognormal distribution. Observe that taking logarithms
in Eq (12.23) establishes that the log of the geometric mean is the arithmetic
mean of the log transformed data.

Introduction to Statistics

439

The harmonic mean, x

h , on the other hand, dened as:
1 1
1
=
,
x
h
n i=1 xi
n

(12.24)

is more appropriate for data involving rates, ratios, or any phenomenon where
the true variable of concern occurs naturally as a reciprocal entity. The classic example is data involving velocities: if a particle covers a xed distance,
s, at varying velocities x1 and x2 , from elementary physics, we are able to
deduce that the average velocity with which it covers this distance is not the
arithmetic mean (x1 + x2 )/2, but the harmonic mean. This, of course, is because, with the distance xed, the consequence of the variable velocity is a
commensurate variation in the time to cover the distance, a reciprocal of the
velocity. Note that if the time of travel, as opposed to the distance, is the
xed quantity, then the average velocity will be the arithmetic mean.
In general the following relationship holds between these various sample
averages:

(12.25)
x
h < xg < x
Note how, by denition, the arithmetic mean is susceptible to undue inuence
from extremely large observations; by the same token, the reverse is the case
with the harmonic mean which is susceptible to the undue inuence of unusually small observations (whose reciprocals become unusually large). Such
inuences are muted with the geometric mean.
Sample Median
Let us begin by reordering the observations x1 , x2 , . . . , xn in ascending
order to obtain x(1) , x(2) , . . . , x(m) , . . . x(n) (we could also do this in descending
order instead); if n is odd, the middle number, x(m) , is the median, where
m = (n + 1)/2.
If n is even, let m = n/2; then the median is the average of the two middle
numbers x(m) and x(m+1) , i.e.
xm =

x(m) + x(m+1)
2

(12.26)

Because the median, unlike the means, does not involve carrying out any
arithmetic operation on the extreme values, x(1) and x(n) , it is much less
susceptible to unusually large or unusually small observations. The median is
therefore quite robust against outliers. Nevertheless, because it does not utilize
all the information contained in the sample data set, it is more susceptible to
chance uctuations.
Sample Mode
The sample mode can only be obtained directly from the frequency distribution.

440

12.4.3

Random Phenomena

Measures of Variability

Averages by themselves do not (and indeed cannot) adequately describe

the entire data distribution: they locate the center but give no information
about how the data are clustered around this center. The following are some
popular measures of sample variability or spread.
Range
This is the simplest measure of variability or spread in a sample data
set. Upon ordering the data in ascending order x(1) , x(2) , . . . , x(n) , the sample
range is simply the dierence between the largest and smallest values, i.e.
R = x(n) x(1)

(12.27)

Because it is strictly a measure of the size of the interval covered by the

sample data, and does not take any other observation into consideration, it is
considered a hasty measure which is very susceptible to chance uctuations,
making it somewhat unstable.
Average Deviation
Dene the deviation of each observation xi from the sample average, x, as:

(note that

n
i=1

di = xi x

(12.28)

1
|di |,
d =
n i=1

(12.29)

di = 0); then
n

known as the average deviation, provides a measure of the average absolute

deviation from the mean. If the sample mean is replaced by the median xm ,
then, the result is
n
1
dm =
|xi xm |,
(12.30)
n i=1
a quantity known as the mean absolute deviation from the median
(MADM). Because the median is more robust to outliers than the sample
average, the MADM, dm , is, by the same token, also more robust to outliers

than d.
Sample Variance and Standard Deviation
The mean squared deviation from the sample mean, dened as:
1
(xi x
)2 ,
n 1 i=1
n

s2 =

(12.31)

Introduction to Statistics

441

TABLE 12.6:

Descriptive statistics for

yield data sets YA and YB
Characteristic
YA values YB values
Mean
75.52
72.47
1.43
2.76
Standard Deviation
2.05
7.64
Variance
Skewness
0.32
0.17
Kurtosis
-0.09
-0.73
50
50
Total number, n
Min
72.30
67.33
74.70
70.14
First quartile, Q1
75.25
72.44
Median
Third quartile, Q3
76.60
74.68
79.07
78.49
Max

is a more popular measure of variability or spread; it is the sample version

of the population variance. In this context, the following is an important
implication of the results of the preceding discussion on measures of central
tendency (that the smallest possible mean squared deviation, E[(X c)2 ], is
achieved when c = ): the smallest achievable mean squared deviation is the
population variance, 2 = E[(X )2 ], the mean squared deviation from the
mean; the mean squared deviation from any other value (central or not) is
therefore greater than 2 .
The positive square root of the sample variance,
&
n

)2
i=1 (xi x
(12.32)
s = + s2 =
n1
is the sample standard deviation; it has the same unit as x, as opposed to s2
which has the unit of x squared.
Example 12.4 SUMMARY DESCRIPTIVE STATISTICS
FOR YIELD DATA SETS OF CHAPTER 1
First obtain, then compare and contrast, summary descriptive statistics
for the yield data sets YA and YB presented in Chapter 1 (Table 1.1).
Solution:
The summary descriptive statistics for these data sets, (obtainable using
any typical software package), are shown in Table 12.6.
The computed average is higher for YA than for YB , but the standard deviation (hence also the variance) is lower for YA ; the very low
skewness values for both data sets indicates a lack of asymmetry. The
values shown above for kurtosis are actually for the so-called excess
kurtosis dened as (4 3), which will be zero for a perfectly Gaussian
distribution; the computed values shown here indicate that both data
distributions are essentially Gaussian.
The remaining quantitative descriptors make up the ve-number

442

Random Phenomena
summary used to produce the typical box-plot; jointly, they indicate
what we already know from Fig 12.11: in every single one of these categories, the value for YA is consistently higher than the corresponding
value for YB .
The modes, which cannot be computed directly from data, are

obtained from the histograms (or frequency polygons) as yA

= 75;

yB = 70 for the specic bin sizes used to generate the respective histograms/frequency polygons (see Figs 12.8 and Fig 1.2 in Chapter 1; or
Figs 12.9, 12.10).

This example recalls the still-unsolved problem posed in Chapter 1, and

it is reasonable to ask whether the quantitative comparisons shown above are
sucient to lead us to conclude that Process A is better than Process B.
While indeed these results seem to indicate that process A might actually be
better than process B, i.e. that YA > YB , any such categorical statement
(that YA > YB ) made at this point, strictly on the basis of this particular
data set alone, will be merely speculative. The summary statistics in Table
12.6 apply only to this particular set of data; a dierent set of sample data
from the processes will produce dierent data and hence dierent summary
statistics. To make any categorical statement about which process is better
and by how much requires more rigorous techniques of statistical inference
that are addressed in the remaining chapters of Part IV.

12.4.4

Supplementing Numerics with Graphics

It is easy to misconstrue the two aspects of descriptive statistics graphical techniques and numerical descriptors as mutually exclusive; or, at the
very least, that the former is more useful for qualitative data while the latter
is more useful for quantitative data. While there is an element of truth to
this latter statement, it is more precise to consider the two aspects rather as
complementary. Graphical techniques are great for conveying a general sense
of the information contained in the data but they cannot be used for quantitative analysis or comparisons. On the other hand, even though these graphical
techniques are not quantitative, they provide insight into the nature of the
data set that mere numbers alone cannot possibly convey. One is incomplete
without the other.
To illustrate this last point, we now present a classic example due to
Anscombe3 . The example involves four data sets, the rst of which is shown
in Table 12.7.
The basic numerical characteristics of X1 and Y1 are as follows: Total
number, n = 11 for both variables; the averages: x
1 = 9.0; y1 = 7.5; the
standard deviations: sx1 = 3.32; sy1 = 2.03; and the correlation coecient
3 Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician,
pp. 195-199.

Introduction to Statistics

443

TABLE 12.7:
The Anscombe data
set 1
X1
Y1
10.00
8.04
8.00
6.95
13.00
7.58
9.00
8.81
11.00
8.33
14.00
9.96
6.00
7.24
4.00
4.26
12.00
10.84
7.00
4.82
5.00
5.68

TABLE 12.8:
2, 3, and
X2
10.00
8.00
13.00
9.00
11.00
14.00
6.00
4.00
12.00
7.00
5.00

The Anscombe data sets

4
Y2
9.14
8.14
8.74
8.77
9.26
8.10
6.13
3.10
9.13
7.26
4.74

X3
Y3
X4
Y4
10.00 7.46 8.00 6.58
8.00 6.77 8.00 5.76
13.00 12.74 8.00 7.71
9.00 7.11 8.00 8.84
11.00 7.81 8.00 8.47
14.00 8.84 8.00 7.04
6.00 6.08 8.00 5.25
4.00 5.39 19.00 12.50
12.00 8.15 8.00 5.56
7.00 6.42 8.00 7.91
5.00 5.73 8.00 6.89

between the two variables: xy = 0.82. A scatter plot of Y1 versus X1 is shown

in Fig 12.19, which indicates a reasonably strong linearly relationship between
the two variables, as correctly quantied by xy = 0.82.
And now, let us consider the remaining data sets 2, 3, and 4, respectively for the variables pairs (X2 , Y2 ), (X3 , Y3 ), and (X4 , Y4 ) as shown in Table 12.8. As the reader is encouraged to conrm, (see Exercise 12.17) the
basic numerical characteristics of each (X, Y ) pair in this data table are easily obtained as follows: Total number, n = 11 for each variable; the aver3 = x
4 = 9.0; y2 = y3 = y4 = 7.5; the standard deviations:
ages: x
2 = x
sx2 = sx3 = sx4 = 3.32; sy2 = sy3 = sy4 = 2.03; and the correlation coecient
between each set of paired variables, xy = 0.82 for (X2 , Y2 ), (X3 , Y3 ), and
(X4 , Y4 ).
Not only are these sets of characteristic numbers identical for these three
data sets, they are also identical to the ones for the rst data set, (X1 , Y1 );

444

Random Phenomena
11
10
9

8
7
6
5
4
5.0

7.5

10.0

12.5

15.0

FIGURE 12.19: Scatter plot of Y1 and X1 from Anscombe data set 1.

and on this basis alone, one might be tempted to conclude that the four data
sets must somehow be equivalent. Of course this is not the case. Yet, how
truly dierent the data sets are becomes quite obvious by mere inspection of
the scatter plots shown in Figs 12.20, 12.21, and 12.22 when compared to Fig
12.19.

As noted earlier, while data set 1 indicates a strong linearly relationship

between the two variables (correctly implied by xy = 0.82), data set 2 clearly
indicates a quadratic relationship between Y2 and X2 . Data set 3, on the other
hand, indicates an otherwise perfectly linear relationship that was somehow
corrupted by a lone outlier, the third entry (13, 12.74). Data set 4 indicates
what could best be considered as the result of a strange 2-level experimental
design involving 10 replicates of the experiment at the low value, X4 = 8, along
with a single experiment at the high value, X4 = 19.

Thus, while good for summarizing data with a handful of quantitative characteristics, numerical descriptions are necessarily incomplete; they can (and
often) omit, or lter out, important distinguishing features in the data sets.
For a complete view of the information contained in any data set, it is important to supplement quantitative descriptions with graphical representations.

Introduction to Statistics

445

10
9
8

7
6
5
4
3
5.0

7.5

10.0

12.5

15.0

FIGURE 12.20: Scatter plot of Y2 and X2 from Anscombe data set 2.

13
12
11

10
9
8
7
6
5
5.0

7.5

10.0

12.5

15.0

FIGURE 12.21: Scatter plot of Y3 and X3 from Anscombe data set 3.

446

Random Phenomena
13
12
11

10
9
8
7
6
5
8

14
X4

FIGURE 12.22: Scatter plot of Y4 and X4 from Anscombe data set 4.

12.5

Summary and Conclusions

This chapter was designed to serve as a transitional link between probability and statistics. By rst articulating the central issue in statistical analysis
(characterizing randomly varying phenomena on the basis of nite data) the
case was made for why statistics must rely on probability even as it complements it. This led to the introduction of the basic concepts involved in
statistics and an overview of what the upcoming detailed study of statistics
entails. Compared to what is covered in the rest of Part IV, this chapters
introduction to descriptive statisticsorganization, graphical representation
and numerical summarization of sample datamay have been brief, but it is
no less important. It is worth reiterating therefore that numerical analysis is
most eective when complemented (wherever possible) with graphical plots.
Here are some of the main points of the chapter again.
Statistics is concerned with fully characterizing randomly varying phenomena on the basis of nite sample data; it relies on probability to
quantify the inevitable uncertainty associated with such an endeavor.
The three central concepts of statistical analysis are:
Population: the complete collection of all the data obtainable from
an experiment; unrealizable in practice, it is to statistics what the
sample/random variable space is to probability;
Sample: specic observations of nite size; a subset of the population;

Introduction to Statistics

447

Statistical Inference: conclusions drawn about a population from

an analysis of a sample, including a measure of the associated uncertainty.
The three aspects of statistics are:
Descriptive Statistics: organizing, summarizing, and interpreting
data;
Inferential Statistics: drawing inference from data;
Statistical Design of Experiments: systematically acquiring informative data.

APPENDIX
We wish to nd the value of c that minimizes the objective function:

(x c)p f (x)dx
(12.33)
= lim p = lim
p

for the continuous random variable, X. Upon taking derivatives and equating
to zero, we obtain:

= lim p
(x c)p1 f (x)dx = 0
(12.34)
p
c

where integration by parts yields:

= 0 = lim f (x)(x c)p | + lim
p
p
c

(x c)p f (x)dx (12.35)

The rst term on the RHS of the equality sign after 0 vanishes because, for
all valid pdfs, f () = f () = 0, so that Eq (12.35) reduces to:

= 0 = lim
(x c)p f (x)dx
(12.36)
p
c

and the indicated limit can only be zero if:

f (x) = 0

(12.37)

and this occurs at the mode of the pdf, f (x). It is left as an exercise to the
reader to obtain the corresponding result for a discrete X (See Exercise 12.19).

448

Random Phenomena

REVIEW QUESTIONS
1. As presented in this chapter, what are the three distinct but related entities involved in a systematic study of randomly varying phenomena?
2. What is the dierence between X, the random variable, and n individual observations, {xi }n
i=1 ?
3. What is the dierence between writing a pdf as f (x) and as f (x|), where is
the vector of parameters?
4. Why is the theoretical description f (x) for any specic randomly varying phenomena never completely available?
5. With probability analysis, which of these two entities, f (x) and {xi }n
i=1 , is available and what is to be determined with it? With statistical analysis, on the other
hand, which of the two entities is available and what is to be determined with it?
6. Statistics is a methodology for doing what?
7. As stated in Section 12.1.2, what three concepts are central to statistical analysis?
8. What is the dierence between a sample and a population?
9. What is statistical inference?
10. Which of the following two entities can be specied `
a-priori, and which is an
`-posteriori entity: (a) the random variable space, VX , in probability, and (b) the
a
population in statistics?
11. Why must one settle for characterizing the population by drawing statistical
inference?
12. How does systematic data collection t into statistical inference?
13. What is the connection between probability, statistics and design of experiments?
14. Into which two categories is statistics primarily categorized?
15. What is Descriptive Statistics?
16. What is Inductive Statistics?
17. What does Design of Experiments entail?
18. What is the dierence between a qualitative and a quantitative variable?

Introduction to Statistics

449

19. What is the dierence between nominal and ordinal data?

20. What is a bar chart and what dierentiates a Pareto chart from it?
21. What is a pie chart?
22. What sort of information does the pie chart excel at conveying compared to the
bar chart? Conversely, what sort of information does the bar chart excel at conveying compared to the pie chart?
23. What is a frequency distribution and how is it related to a histogram?
24. What is the relationship between the frequency distribution of a data set and
the theoretical distribution of the population from which the data was obtained?
25. Why should a histogram be based on an ample amount of data?
26. What is the bin size, and why is its choice important in generating informative
histograms?
27. What is the ve-number summary for a data set?
28. What is a box plot?
29. What are box plots particularly adept at illustrating?
30. What sort of data sets are box plots better for displaying than histograms?
31. What is a scatter plot and what is it most useful for?
32. What are three common measures of central tendency?
33. In choosing c to minimize the expected normed deviation, E(|X c|n ), what
quantity is obtained when n = 1, or n = 2, or n = ?
34. What is the dierence between an arithmetic mean, a geometric mean, and a
harmonic mean? Under what conditions will one be preferred over the others?
35. Dene three common measures of variability.
36. In what way do graphical techniques complement quantitative numerical techniques of summarizing data?
37. What is the main lesson of the Anscombe data sets discussed in Section 12.4.4?

450

Random Phenomena

EXERCISES
Section 12.1
12.1 Consider the experiment of tossing a single six-faced die and recording the
number shown on the top face after the die comes to rest.
(i) What is the random variable in this case, and what is the random variable space?
(ii) What is the population and what will constitute a sample from this population?
(iii) With adequate justication, postulate a probability model for this random variable.
12.2 Consider a chess player who is participating in a two-game, pre-tournament
qualication series where the outcome of each game is either a win, a loss, or a draw.
(i) If we are interested only in the total number of wins and the total number of
draws, what is the random variable in this case, its dimensionality, and the associated
random variable space?
(ii) Describe the population in this case and what will constitute a sample from such
a population.
(iii) With adequate justication, postulate a reasonable probability model for this
random variable. What are the unknown parameters?
12.3 Lucas (1985)4 studied the number and frequency of occurrences of accidents
over a 10-year period at a DuPont company facility. If the variable of interest is the
time between occurrences of these accidents, describe the random variable space,
the population, and what might be considered as a sample. Postulate a reasonable
probability model for this variable and note how many parameters it has.
12.4 In studying the useful lifetime (in years) of a brand of washing machines with
the aid of a Weibull probability model, indicate the random variable space, the population and the population parameters. How can one go about obtaining a sample
{xi }50
i=1 from this population?
Section 12.2
12.5 Classify each of the following variables as either quantitative or qualitative; if
quantitative, specify whether it is discrete or continuous; if qualitative whether it is
ordinal or nominal.
(i) Additive Type, (A, B, or C); (ii) Additive Concentration, (moles/liter); (iii) Heat
Condition, (Low, Medium High); (iv) Agitation rate, (rpm); (v) Total reactor Volume, (liters); (vi) Number of reactors in ensemble, (1, 2, or 3).
12.6 The Lucas (1985) study of Exercise 12.3 involved the following variables:
(i) Period, (I, II); (ii) Length of Study, (Years); (iii) Number of accidents; (iv) Type
of Accident; (v) Time between accidents.
Classify each variable as either quantitative or qualitative; if quantitative, specify whether it is discrete or continuous; if qualitative whether it is ordinal or nominal.
4 Lucas

J. M., (1985). Counted Data CUSUMs, Technometrics, 27, 129144

Introduction to Statistics

451

12.7 A study of the eect of environmental cues on cell adhesion involved the following variables. Classify each one as either quantitative (discrete or continuous) or
qualitative (ordinal or nominal).
(i) Type of stimulus, (Mechanical, Chemical, Topological); (ii) Ligand concentration;
(iii) Surface type (Patterned, Plain); (iv) Mechanical force; (v) Number of integrin
clusters; (vi) Integrin cluster size; (vii) Cell Status (Adherent, Non-Adherent).
Section 12.3
12.8 The table below shows where chemical engineers found employment in 2000,
categorized by degree. For each degree category, (BS, MS and PHD) draw a bar chart
and a pie chart. Comment on what stands out most prominently in each case within
each degree category and across the degree categories (for example, Academia
across the categories.)

Employer
Industry
Government
Grad/Professional School
Returned to Home Country
Unemployed
Unknown Employment
Academia
Other

BS
Placement %
55.9
1.7
11.2
1.3
9.5
18.8
0.0
1.8

MS
Placement %
44.1
2.2
33.1
4.7
4.5
7.4
1.8
2.2

PhD
Placement %
57.8
0.8
13.1
0.1
2.8
6.4
16.5
1.7

12.9 Generate Pareto Charts for each degree category for the chemical engineering
employment data in Exercise 12.8. Interpret these charts.
12.10 The data in the table below (adapted from a 1983 report from the Presidents
council on Physical Fitness and Sports) shows the number of adult Americans (nonprofessionals) participating in the indicated sports.
Type
of
Sport
Basketball
Bicycling
Football (Touch)
Golf
Hiking
Ice-skating
Racquetball
Roller-skating
Running
Skiing
Softball
Swimming
Tennis
Volleyball

Number of
Participants
(in millions)
29
44
18
13
34
11
10
20
20
10
26
60
23
21

Generate a bar chart and a Pareto chart for this data; interpret the charts. Why

452

Random Phenomena

is it unadvisable to use a pie chart to represent such data?

12.11 The following data set has been adapted from information provided by the
IRS in 1985 about the population of well-to-do (WTD) individuals in various
states. (At the time, a WTD individual was dened as someone with gross assets
$500, 000).
State
California
Texas
Florida
New York
Illinois
Pennsylvania
Ohio
New Jersey
Iowa
Michigan

WTD
Population
301,500
204,800
151,800
110,100
108,000
86,800
52,500
51,300
50,800
48,100

Generate a bar chart for the data in terms of relative frequency. In how many of
these 10 states will one nd approximately 80% of the listed well-to-do individuals?
Section 12.4
12.12 The data in the following table shows samples of size n = 20 drawn from
four dierent populations coded as N, L, G and I. Generate a histogram and a box
plot for each of the data sets. Discuss what these plots indicate about the general
characteristics of the population from which the data were obtained.
XN
9.3745
8.8632
11.4943
9.5733
9.1542
9.0992
10.2631
9.8737
7.8192
10.4691
9.6981
10.5911
11.6526
10.4502
10.0772
10.2932
11.7755
9.3790
9.9202
10.9067

XL
7.9128
5.9166
4.5327
33.2631
24.1327
5.4151
16.9556
3.9345
35.0376
25.1182
1.1804
2.3503
15.6894
5.8929
8.0254
16.1482
0.6848
6.6974
3.6909
34.2152

XG
10.0896
15.7336
15.0422
5.5482
18.0393
17.9543
12.5549
9.6640
14.2975
4.2599
19.1084
7.0735
7.6392
14.1899
13.8996
9.7680
8.5779
7.5486
10.4043
14.8254

XI
0.084029
0.174586
0.130492
0.115567
0.187260
0.100054
0.101405
0.100835
0.097173
0.141233
0.060470
0.127663
0.074183
0.086606
0.084915
0.242657
0.052291
0.116172
0.084339
0.205748

12.13 For each sample in Exercise 12.12, compute the (i) arithmetic mean; (ii) geometric mean; (iii) median; and (iv) harmonic mean. Which do you think is a more

Introduction to Statistics

453

appropriate measure of the central tendency of the original population from which
these samples were drawn, and why?
12.14 The table below shows a relative frequency summary of sample data on distances between DNA replication origins (inter-origin distances), measured by Li et
al., (2003)5 , with an in vitro Xenopus egg extract assay in Chinese Hamster Ovary
(CHO) cells, as reported in Chapter 7 of Birtwistle (2008)6 .
Inter-Origin
Distance (kb)
x
0
15
30
45
60
75
90
105
120
135
150
165

Relative
Frequency
fr (x)
0.00
0.09
0.18
0.26
0.18
0.09
0.04
0.03
0.05
0.04
0.01
0.02

(The data set is similar to, but dierent from, the one in Application Problem
9.40 in Chapter 9.) Obtain a histogram of the data and determine the mean, variance and your best estimate of the median.
12.15 From the data given in Table 12.5 in the text, generate a scatter plot of (i)
city gas mileage against engine capacity, and (ii) city gas mileage against number of
cylinders, for the two-seater automobiles listed in that table. Compare these plots to
the corresponding ones in the text for highway gas mileage. Are there any surprises
in these city gas mileage plots?
12.16 Let X1 and X2 represent, respectively, the engine capacity, in liters, and number of cylinders for the population of two-seater automobiles; let Y1 and Y2 represent
the city gas mileage and highway gas mileage, respectively, for these same automobiles. Consider that the data in Table 12.5 constitute appropriate samples from the
respective populations. From the supplied sample data, compute the complete set of
6 pairwise correlation coecients between these variables. Comment on what these
correlation coecients mean.
12.17 Conrm that the basic numerical characteristics of each (X, Y ) pair in Table
12.8 are as given in the text.
12.18 Determine the value of c that minimizes the mean absolute deviation 1
5 Li, F., Chen, J., Solessio, E. and Gilbert, D. M. (2003). Spatial distribution and specication of mammalian replication origins during G1 phase. J Cell Biol 161, 257-66.
6 M. R. Birtwistle, (2008). Modeling and Analysis of the ErbB Signaling Network: From
Single Cells to Tumorigenesis, PhD Dissertation, University of Delaware.

454

Random Phenomena

between it and the possible values of the discrete random variable, X, whose pdf is
given as f (xi ), i.e.,

|xi c|f (xi )dx
(12.38)
1 =
i=0

12.19 Find the value of c that minimizes the objective function:

= lim p = lim
p

(xi c)p f (xi )

(12.39)

i=0

and show that it is the mode of the discrete pdf f (xi ), i.e., f (c) > f (xi ) for all i.

APPLICATION PROBLEMS
12.20 A quality control engineer at a semi-conductor manufacturing site is concerned about the number of contaminant particles (aws) found on each standard
size silicon wafer produced at the site. A sample of 20 silicon wafers selected and
examined for aws produced the result (the number of aws found on each wafer)
shown in the following table.
3
4

0
1

0
2

2
3

3
2

0
1

3
2

2
4

1
0

2
1

(i) For this particular problem, what is the random variable, X, the set {xi }n
i=1 ,
and why is the Poisson model, with the single parameter , a reasonable probability
model for the implied phenomenon?
(ii) From the expression for f (x|), compute the theoretical probabilities when the
population parameter is specied as 0.5, 1.0 and 1.5. From these theoretical probabilities, which of the postulated population parameters appears more representative
of observations?
12.21 The time in months between occurrences of safety violations in a toll manufacturing facility is shown in the table below.
1.31
1.94
0.79

0.15
3.21
1.22

3.02
2.91
0.65

3.17
1.66
3.90

4.84
1.51
0.18

0.71
0.30
0.57

0.70
0.05
7.26

1.41
1.62
0.43

2.68
6.75
0.96

0.68
1.29
3.76

(i) Determine the mean, median and variance for this sample data. Construct a
histogram and explain why the observed shape is not surprising, given the nature of
the phenomenon in question.
(ii) What is a reasonable probability model for the population from which the data
came? If the population parameter, the mean time between violations, is postulated
to be 2 months, compute the theoretical probability of going more than 2 months
without a safety violation. Is this theoretical probability compatible with this sample
data? Explain.
(iii) In actual fact, the data were obtained for three dierent operators and have been

Introduction to Statistics

455

arranged accordingly: the rst row is for Operator A, the second row, for Operator B, and the third row for Operator C. It has been a long-held preconception
in the manufacturing facility that Operator A is relatively more safety-conscious
than the other two. Strictly on the basis of any graphical and numerical descriptions
of each data set that you deem appropriate, is there any suggestion in this data
set that could potentially support this preconception? Explain.
12.22 Nelson (1989)7 quantied the cold cranking power of ve dierent battery
types in terms of the number of seconds that a particular battery generated its rated
amperage without falling below 7.2 volts, at 0 F. The experiment was repeated four
times for each battery type and the resulting data set is shown in the following table.
Battery Type
Experiment No
1
2
3
4

41
43
42
46

42
43
46
38

27
26
28
27

48
45
51
46

28
32
37
25

Is there any suggestion of descriptive (as opposed to inductive) evidence in

this data set to support the postulate that some battery types are better than others? Which ones appear better? (More precise inductive methods for answering
such questions are discussed in Chapters 15 and 19.)
12.23 Consider the data in the following table showing the boiling point (in C) of
8 hydrocarbons in a homologous series, along with n, the number of carbon atoms
in each molecule.
Hydrocarbon
Compound
Methane
Ethane
Propane
n-Butane
n-Pentane
n-Hexane
n-Heptane
n-Octane

n, Number of
Carbon Atoms
1
2
3
4
5
6
7
8

Boiling Point

C
-162
-88
-42
1
36
69
98
126

What does this data set imply about the possibility of predicting the boiling
point of compounds in this series on the basis of the number of carbon atoms? Compute the correlation coecient between these two variables, even though the number
n is not a random variable. Comment on what the computed value indicates.
12.24 The following table contains experimental data on the thermal conductivity,
k (W/m- C), of a metal, determined at various temperatures.
7 Nelson,

L.S., (1989). Comparison of Poisson means, J. of Qual. Tech., 19, 173179.

456

Random Phenomena
k (W/m- C)
93.228
92.563
99.409
101.590
111.535
115.874
119.390
126.615

Temperature ( C)
100
150
200
250
300
350
400
450

What sort of systematic functional relationship between the two variables (if
any) does the evidence in the data suggest? Compute the correlation coecient between the two variables and comment on what this value indicates. What would you
recommend as a reasonable value to postulate for the thermal conductivity of the
metal at 325 C? Justify your answer succinctly.
12.25 The following data set, from the same study by Lucas (1985) referenced in
Exercise 12.3, shows the actual number of accidents occurring per quarter (three
months) separated into two periods: Period I is the rst ve-year period of the
study; Period II, the second ve-year period.

5
4
2
5
6

Period I
5 10 8
5 7
3
8 6
9
6 5 10
3 3 10

3
1
7
1
4

Period II
4 2 0
3 2 2
7 1 4
2 2 1
4 4 4

Provide an appropriate statistical description and summary of this data set;

comment on any distinctive characteristics and postulate potential explanations for
your observation.
12.26 According to census records, the age distribution of the inhabitants of the
United States in 1960 and in 1980 is as shown in the table below.
Age Group
<5
59
1014
1519
2024
2529
3034
3539
4044
4549
5054
5559
6064
65

1960
20,321
18,692
16,773
13,219
10,801
10,869
11,949
12,481
11,600
10,879
9,606
8,430
7,142
16,560

1980
16,348
16,700
18,242
21,168
21,319
19,521
17,561
13,965
11,669
11,090
11,710
11,615
10,088
25,550

From an appropriate statistical description of these data sets, comment on the

indicated changes in the population in the two decades between 1960 and 1980.

Introduction to Statistics

457

Identify any features that might be due to the baby-boom generationthose born
during the period from the end of World War II until about 1965.

458

Random Phenomena

Chapter 13
Sampling

13.1 Introductory Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13.1.1 The Random Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.1.2 The Statistic and its Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2 The Distribution of Functions of Random Variables . . . . . . . . . . . . . . . . . . . . .
13.2.1 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2.2 Some Important Sampling Distribution Results . . . . . . . . . . . . . . . . . .
13.3 Sampling Distribution of The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.3.1 Underlying Probability Distribution Known . . . . . . . . . . . . . . . . . . . . . .
13.3.2 Underlying Probability Distribution Unknown . . . . . . . . . . . . . . . . . . .
13.3.3 Limiting Distribution of the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.3.4 Unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.4 Sampling Distribution of the Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

459
460
461
461
462
463
463
463
465
465
467
467
470
472
476
477
478
482

If in other sciences we should arrive

at certainty without doubt and truth without error,
it behooves us to place the foundation of knowledge in mathematics.
Roger Bacon (c.1220c.1292) Opus Majus, Bk.1 Ch. 4

If, as stated in Chapter 12, inductive (or inferential) statistics is primarily concerned with drawing inference about a population from sample information,
then a logical treatment of inductive statistics must begin with sampling
a formal study of samples from a population. Because it is a nite collection
of individual observations, a sample is itself susceptible to random variation
since dierent samples drawn from the same population under identical conditions will be dierent. As such, before samples can be useful for statistical
inference concerning the populations that produced them, the variability inherent in samples must be characterized mathematically (just as was done for
individual observations xi from a random variable, X). How one characterizes
the variability inherent in samples, as distinct from, but obviously related to,
characterizing the variability inherent in individual observations of a random
variable, X, is the focus in this chapter. Sampling is the foundational element
of statistical inference, and this chapters discussion is an indispensable precursor to the discussions of estimation and hypothesis testing to follow in the
next two chapters.
459

460

13.1

Random Phenomena

Introductory Concepts

As we now know, the role played by the sample space (or, equivalently,
the random variable space) in probability theory is analogous to that of the
population in statistics. In this regard, what the randomly varying individual
observation, x, is to the random variable space, VX , in probability theory, the
nite-sized sample, {xi }ni=1 , is to the population in statistics. In the former,
the variability inherent to individual observations is characterized by the pdf,
f (x), an ensemble representation that is then used to carry out theoretical
probabilistic analysis for the elements of VX . There is an analogous problem
in statistics: in order to characterize the population appropriately, we must
rst gure out how to characterize the variability intrinsic to the nite-sized
sample. The entire subject matter of characterizing and analyzing samples
from a population, and employing such results to make statistical inference
statements about the population, is known as sampling, or sampling theory.
The following three concepts,
1. The Random Sample;
2. The Statistic; and
3. The Distribution of a Statistic (or The Sampling Distribution),
are central to sampling. As discussed in detail shortly, sampling theory combines these concepts into the basis for characterizing the uncertainty in samples in terms of probability distributions that can then be used for statistical
inference.

13.1.1

The Random Sample

Since the nite-sized sample is the only source of information about an

entire population, it is essential that the sample be representative of the population. This is the primary motivation behind the concept of the random sample, explained as follows. Consider a set of observations (data) {x1 , x2 , . . . , xn }
drawn from a population of size N (where N is possibly innite): if all possible
subsets n of the N elements of the population have equal probabilities of being
chosen, then the observations constitute a random sample from the population. The rationale is clear: with equal probabilities of selection, no particular
subset will preferentially favor any particular aspect of the population. The
mathematical implication of this concept (sometimes considered as the formal
denition) now follows.

Sampling

461

Denition: Let X1 , X2 , . . . , Xn denote n mutually, stochastically,

independent random variables, each of which has the same, but
possibly unknown, pdf f (x); i.e. the pdfs of X1 , X2 , . . . , Xn are,
respectively, f1 (x1 ) = f (x1 ); f2 (x2 ) = f (x2 ); ; fn (xn ) = f (xn ),
so that the joint pdf is f (x1 )f (x2 ) f (xn ). The random variables
X1 , X2 , . . . , Xn constitute a random sample from a distribution
that has the pdf f (x).

The condition noted here for the random variables is also sometimes rendered as independently and identically distributed or i.i.d.
The practical implication of this concept and the denition given above is
that if we can ensure that the sample from a population is drawn randomly,
then the joint distribution of the sample is a product of the contributing pdfs.
This rather simple concept signicantly simplies the theory and practice of
statistical inference, as we show shortly.

13.1.2

The Statistic and its Distribution

Denitions
A statistic is any function of one or more random variables that does
not depend upon any unknown population parameters. For example, let
X1 , X2 , . . . , Xn be mutually stochastically independent random variables, each
with identical N (, 2 ) distributions, with unknown parameters and ; then
the random variable Y dened as:
Y =

(13.1)

i=1

is a statistic. It is a function of n random variables and it does not depend on

any of the unknown parameters, and . On the other hand,
Z=

(13.2)

is a function of the random variable X1 , but it depends on and ; unless

these parameters are known, Z does not qualify as a statistic.
In general, for a set of random variables X1 , X2 , . . . , Xn constituting a
random sample, the random variable Y that is a function of these random
variables, i.e.
(13.3)
Y = g(X1 , X2 , . . . , Xn )
is a statistic so long as g(.) is independent of unknown parameters. Observe
that given the joint pdf f (x1 , x2 , . . . , xn ), we can use Eq (13.3) to obtain
fY (y), the pdf of the statistic, Y . It is important to note:

462

Random Phenomena

1. Even though a statistic (say Y as dened above) does not depend upon
an unknown parameter, the distribution of a statistic (say fY (y)) quite
often depends on unknown parameters.
2. The distributions of such statistics are called sampling distributions because they are distributions of functions of samples. Since a statistic,
as dened above, is itself a random variable, its sampling distribution
describes the variability (chance uctuations) one will observe in it as a
result of random sampling.
Utility
The primary utility of the statistic and its distribution is in determining
unknown population parameters from samples, and in quantifying the inherent variability. This becomes evident from the following re-statement of the
problem of statistical inference:
1. The pdf for characterizing the random variable, X, f (x|), contains unknown parameters, ; were it possible to observe, via experimentation,
the complete population in its entirety, we would be able to construct,
from such observations, the complete f (x|), including the parameters;
however, only a nite sample from the population is available via experimentation. When the form of f (x|) is known we are left with the
issue of determining the unknown parameters, , from the sample, i.e.,
making inference about the population parameter from sample data.
2. We make these inferences by investigating random samples, using appropriate statistics (quantities calculated from the random sample)
that will provide information about the parameters.
3. These statistics, which enable us to determine the unknown parameters, are themselves random variables; the distribution of such statistics
then enable us to make probability statements about these statistics and
hence the unknown parameters.
It turns out that most of the unknown parameters in a pdf representing
a population are contained in the mean, , and variance, 2 , of the pdf in
question. Thus, once the mean and variance of a pdf are known, the naturally
occurring parameters can then be deduced. For example, if X N (, 2 ), the
mean and the variance are in fact the naturally occurring parameters; for the
gamma random variable, X (, ), recall that
=
2 =

(13.4)
(13.5)

a pair of equations that can be solved simultaneously to yield:

=
=

2 / 2
2 /

(13.6)
(13.7)

Sampling

463

For the Poisson random variable, X P(), = 2 = , so that the parameter is directly determinable from either the mean or the variance (or
both).
Thus, it is often sucient to use statistics that represent the mean and the
variance of a population to determine the unknown population parameters.
It is therefore customary for sampling theory to concentrate on the sampling
distributions of the mean and of the variance. And now, because statistics are
functions of random variables, determining sampling distributions requires
techniques for obtaining distributions of functions of random variables.

13.2

The Distribution of Functions of Random Variables

The general problem of interest may be stated as follows: given the joint
pdf for n random variables X1 , X2 , . . . , Xn , nd the pdf fY (y) for the random
variable Y dened as
(13.8)
Y = g(X1 , X2 , . . . , Xn ).

13.2.1

General Overview

This is precisely the problem discussed in some detail in Chapter 6, where

various methods of solution were presented. For example, it is shown in Example 6.7 that if X1 (, 1) and X2 (, 1), then the functions dened
as:
Y1
Y2

= X1 + X2
X1
=
X1 + X2

(13.9)
(13.10)

have the following distributions: Y1 ( + , 1) and Y2 B(, ). Also,

given two Poisson random variables X1 P(1 ) and X2 P(2 ), then a
statistic dened as:
(13.11)
Y = X1 + X2
can be shown (using methods of characteristic functions, for example) to possess a Poisson distribution with parameters (1 + 2 ), i.e. Y P(1 + 2 ).
These general ideas can be applied specically to sampling distributions that
are of interest in statistical inference.

13.2.2

Some Important Sampling Distribution Results

As we will soon see, many (but not all) classical statistical inference problems involve sampling from distributions that are either exactly Gaussian or

464

Random Phenomena

approximately so. The following is a collection of some key results regarding

sampling from the Gaussian and related distributions.
1. Linear Combination of Gaussian Random Variables: Consider n
mutually stochastically independent random variables, X1 , X2 , . . . , Xn , with
respective pdfs N (1 , 12 ), N (2 , 22 ), . . . N (n , n2 ); the random variable:
Y = k1 X1 + k2 X2 + + kn Xn

(13.12)

where k1 , k2 , . . . , kn are real constants, possesses a Gaussian distribution

N (y , y2 ) where
y

= k1 1 + k2 2 + + kn n

(13.13)

= k12 12 + k22 22 + + kn2 n2

(13.14)

These results are straightforward to establish (see Exercise 13.4). In particular,

if 1 = 2 = = n = , and 12 = 22 = = n2 = 2 , so that the random
variables Xi ; i = 1, 2, . . . , n are all identically distributed, then:
n

ki
(13.15)
y =

y2

i=1
n

ki2

(13.16)

i=1

Furthermore, if ki = 1/n for all i, then

(13.17)

2 /n

(13.18)

Even if the distributions of Xi are not Gaussian, but the means and variances
are still i and i2 respectively, clearly, the resulting distribution of Y will be
determined by the underlying distributions of Xi , but its mean and variance,
y and y2 , will still be as given in Eqs (13.17) and (13.18). These results are
also fairly straightforward to establish (see See Excercise 13.5).
2. Sum of Squares of Standard Normal Variables: Consider a random
sample of size n from a Gaussian N (, 2 ) distribution, X1 , X2 , . . . , Xn ; the
random variable
2
n

Xi
(13.19)
Y =

i=1
has a 2 (n) distribution.
3. Sum of Chi-Square Random Variables: Consider n mutually stochastically independent random variables, X1 , X2 , . . . , Xn , with respective pdfs
2 (r1 ), 2 (r2 ), , 2 (rn ); the random variable
Y = X1 + X2 + + Xn

(13.20)

Sampling

465

has a 2 (r) distribution with degrees of freedom,

r = r1 + r2 + + rn .

(13.21)

These results nd signicant application in classic statistical inference.

13.3

Sampling Distribution of The Mean

The general problem of interest is as follows: given (X1 , X2 , . . . , Xn ), a

random sample from a distribution with pdf f (x), the statistic dened as

= 1
X
Xi
n i=1
n

(13.22)

is a random variable whose specic value, the actual sample average, x, will
vary from one specic random sample to another. What is the theoretical

sampling distribution of the random variable, X?

As might be expected, the answer to this question depends on what is
known about the distribution from which the random sample is drawn. We
now discuss the cases most relevant to statistical inference.

13.3.1

Underlying Probability Distribution Known

If we know f (x), the distribution of the population from which the sample
is drawn, we can use the techniques discussed in Chapter 6 (and mentioned
The next two
above) to obtain the sampling distribution of the mean X.
examples illustrate this point for the Gaussian pdf and the exponential pdf.
Example 13.1: SAMPLING DISTRIBUTION: MEAN OF
RANDOM SAMPLE FROM GAUSSIAN DISTRIBUTION
If X1 , X2 , . . . , Xn is a random sample from a Gaussian distribution with
dened in Eq 14.69.
mean and variance 2 , nd the distribution of X
Solution:
First, X1 , X2 , . . . , Xn is a random sample from the same Gaussian distribution, whose characteristic function is:

1
(t) = exp jt 2 t2
(13.23)
2
By virtue of the independence of the random variables, and employing
results about the characteristic function of linear sums of independent
is obtained as
random variables, then the characteristic function of X

n

1 2 2
j
t
exp
t
(13.24)
X (t) =
n
2 n2
i=1

466

Random Phenomena
This product of n identical exponentials becomes an exponential of the
sums of the terms in the winged brackets, and nally simplies to give:

1 2 2
t
X (t) = exp jt
(13.25)
2 n
which we recognize immediately as the characteristic function of a Gaussian random variable with mean , and variance 2 /n. We therefore
N (, 2 /n), i.e. that the sampling disconclude that, in this case, X
tribution of the mean of a random sample of size n from a N (, 2 )
is also a Gaussian distribution with the same mean, but with variance
2 /n.
Example 13.2: SAMPLING DISTRIBUTION: MEAN OF
RANDOM SAMPLE FROM EXPONENTIAL DISTRIBUTION
If X1 , X2 , . . . , Xn is a random sample from an exponential distribution,
dened in Eq 14.69.
E (), nd the distribution of X
Solution:
Again, as with Example 13.1, since (X1 , X2 , . . . , Xn ) is a random sample from the same exponential distribution, we begin by recalling the
characteristic function for the exponential random variable:
1
(13.26)
(t) =
(1 jt)
By virtue of the independence of the random variables, and employing
results about the characteristic function of linear sums of independent
is obtained as
random variables, the characteristic function of X
X (t) =

i=1

1
1
=
n
1 j nt
1 j n t

(13.27)

This is recognizable as the characteristic function of a gamma random

variable with parameters and , where

(13.28)

(13.29)

(n, /n). Observe that because the mean value for a ( , )

i.e. X
random variable is , and the variance is 2 , the implication
in this case is n./n = ; and the
here is that the expected value of X
2
variance is /n.

Some important points to note from these examples:

2
the variance, of the pdf from which
1. If X is the expected value, and X
the random sample was drawn, these examples show that, for both the
Gaussian pdf and the exponential pdf, the expected value and variance
are given by:
of the sample mean, X,

X
2
X

= X
2
X
=
n

(13.30)
(13.31)

Sampling

467

2. What is true for these two example random variables is true in general,
regardless of the underlying distribution (although we have not proved
this formally).
3. The implications are as follows: the expectation of the sample mean
is identical to population mean; and the variance of the sample mean
goes to zero as n . In anticipation of a more detailed discussion
in Chapter 14, we note that the sample mean appears to have desirable
properties that recommend its use in determining the true value of the
unknown population mean.

13.3.2

Underlying Probability Distribution Unknown

If the form of the pdf, f (x), underlying a population is unknown, we cannot, in general, determine the full sampling distribution for any random sample
drawn from such a population. Nevertheless, the following information is still
available, regardless of the underlying pdf:
If the random sample (X1 , X2 , . . . , Xn ) comes from a population with mean
is
and variance 2 , but whose full pdf is unknown, then the sample mean X
2
a random variable whose mean X and variance X are given by the following
expressions:
X
2
X

=
#
=

(13.32)
2
n ;

2
N n
n
N 1 ;

for samples from an innite population

(13.33)
for samples from a population of size N

These results are straightforward to establish.

For innite populations, the standard deviation of the sample mean, X ,
is:

(13.34)
X =
n
known as the standard error of the mean.

13.3.3

Limiting Distribution of the Mean

As shown in the last subsection, if the underlying population pdf is un the mean of a random sample
known, the full sampling distribution of X,
drawn from this population, cannot be determined; but the mean and variance of the sampling distribution are known. Nevertheless, even though the
complete sampling distribution is unknown in general, we know the limiting
distribution (as n ) of a closely related random variable, the standardized
mean dened as:

X

(13.35)
Z=
/ n

468

Random Phenomena

The limiting distribution of Z is given by the following theorem.

be the mean of
The Central Limit Theorem (CLT):Let X
the random sample (X1 , X2 , . . . , Xn ) taken from a population with
mean, , and (nite) variance 2 . Dene the random variable Z
according to Eq (13.35); then the pdf of Z, f (z), tends to N (0, 1),
a standard normal distribution, in the limit as n .

Remarks:
1. Regardless of the distribution underlying the original population from
which the random sample was drawn, the distribution of the sample
mean approaches a normal distribution as n . In fact, for n as
small as 25 or 30, the normal approximation can be quite good.

2. The random variable n(X
)/ is approximately distributed N (0, 1);
it will therefore be possible to employ the standard normal distribution,

N (0, 1), to obtain approximate probabilities concerning X.

3. If the original population has a N (, 2 ) distribution, then, it can be
shown that the random variable Z, the standardized mean, dened in
Eq (13.35), has exactly the N (0, 1) distribution.
We are now in a position to consider how this result might nd application
in statistical inference about the mean of a population. Consider a random
is computed: if the
sample (X1 , X2 , . . . , Xn ), from which the sample mean, X,
population variance is known, then regardless of the underlying population
pdf, the standard normal distribution can be used to determine probabilities
indirectly via the standardized mean. Let us illustrate this point with
about X
the following example.
Example 13.3: PROBABILITIES CONCERNING MEAN
LIFETIMES OF DIGITAL LIGHT PROCESSING (DLP)
PROJECTOR LAMPS
As justication for its high price, an expensive brand of DLP projector
lamps has been purported to have an impressively long average lifetime
of 5,133 hrs, with a standard deviation of 315 hrs. (1) In preparing to
send a trial sample of 40 to a customer, the company manufacturing
the lamps wishes to accompany the shipment with a factory specication stating the probability that the mean lifetime for this sample
will be between the round numbers 5,100 and 5,200 hrs. Determine this
probability. (2) In a follow-up to the shipment, information from enduse purchasers was used to track the actual lifetimes of the 40 LCD
projector lamps; the result showed an average lifetime less than 5,000
hours. Compute the probability of obtaining such a low average lifetime

Sampling

0.4

0.655

0.3

Density

469

N(0,1)

0.2

0.1

0.0

-0.66

0
X

1.34

FIGURE 13.1: Sampling distribution for mean lifetime of DLP lamps in Example 13.3
< 5200) = P (0.66 < Z < 1.34)
used to compute P (5100 < X
by chance alone, if the lamps truly came from a collection (population)
with = 5, 133 and = 315.
Solution:
< 5200),
(1) The problem requires that we determine P (5100 < X
but with no specied pdf, we cannot calculate this probability directly.
Nevertheless, knowing that the standardized mean, Z, has a N (0, 1) distribution allows us to compute the approximate probability as follows:

5200 5133
5100 5133
< 5200) = P

<Z<
P (5100 < X
315/ 40
315/ 40
= P (0.66 < Z < 1.34) = 0.655
(13.36)
where the indicated probability has been computed directly from the
computer program MINITAB using the cumulative probability calculation option for a Normal distribution (Calc > Prob Dist > Normal),
with mean = 0, standard deviation = 1, to obtain FZ (1.34) = 0.910 and
FZ (0.66) = 0.255 yielding the indicated result. (See Fig 13.1). Tables
of standard normal probabilities could also be used to obtain this result.
Thus there is a 65.5% chance that the actual average lifetime will be
between 5,100 and 5,200 hours if the lamps truly came from a population
with = 5, 133 and = 315.
(2) Employing the same approximate N (0, 1) distribution for the
< 5000)
standardized mean, we are able to compute the required P (X
as follows:

)
n(X
40(5000 5133)

<
P (X < 5000) = P

315
= P (Z < 2.67) = 0.004
(13.37)

470

Random Phenomena

0.4

Density

0.3

N(0,1)

0.2

0.1
0.00379
0.0

-2.67

0
X

FIGURE 13.2: Sampling distribution for average lifetime of DLP lamps in Example
< 5000) = P (Z < 2.67)
13.3 used to compute P (X
where the probability is obtained directly from MINITAB again using
the cumulative probability calculation option for a Normal distribution
(Calc > Prob Dist > Normal), with mean = 0, standard deviation =
1, to obtain FZ (2.67) = 0.004 (see Fig 13.2).
And now, because the probability of obtaining by chance alone
a sample of 40 lamps with such a low average lifetime (< 5, 000 hours)
from a population purported to have a much higher average lifetime, is
so small, it is very doubtful that this sample came from the postulated
population. It appears more likely that the result from this sample is
more representative of the true lifetimes of the lamps. If true, then the
practical implication is that there is reason to doubt the claim that
these DLP projector lamps truly merit the purportedlong lifetime
characterization.

This example is a preview of what is ahead in Chapters 14 and 15, hinting at

some of the principles used in formal statistical inference.

13.3.4

Unknown

When the underlying population distribution is unknown, the concept of

as discussed in the previous subsection,
employing a limiting distribution for X
works only if the population variance is known; only then can the standardized
mean, Z, dened as in Eq (13.35) be a bona-de statistic. If 2 is unknown,
Z is no longer a proper statistic because it will then contain an unknown
population parameter. (It is important to realize that in the current context,
is with respect to a postulated population
the sampling distribution of X

Sampling

471

is supposed to represent; with specied, it is no longer an

mean, , that X
unknown population parameter.)
When 2 is unknown, we have two options: (i) if the sample size, n, is
large (say n 50), S 2 , the sample variance, provides a good approximation
to 2 ; (ii) when
n is small, it seems reasonable to contemplate using the
)/S, which is the standardized mean, Z, with the
random variable, n(X
unknown population standard deviation replaced with the sample standard
deviation, S, dened as:
&
n
2
i=1 (Xi X)
(13.38)
S=
n1
Unfortunately, nothing can be said in general about the distribution of this
random variable if the underlying population distribution is unknown, and n
is small. However, if the sample comes from a population having a Gaussian
distribution (the so-called normal population), then the following result holds:

and S be, respectively, the mean and standard deviation

Let X
of the random sample (X1 , X2 , . . . , Xn ) of size n drawn from a
population with distribution N (, 2 ); then the statistic
T =

S/ n

(13.39)

has a Students t-distribution with = n 1 degrees of freedom.

Remarks:
1. We encountered this result in Chapter 9 (section 9.3.5) during our discussion of probability models for continuous random variables.
2. This result is somewhat more general than the Central Limit Theorem
because it does not require knowledge of 2 ; conversely, it is less general in that it requires the normality assumption for the underlying
distribution, which the CLT does not require.
3. The result holds exactly for any n: under the normality assumption, the
pdf of T is exactly the t-distribution regardless of sample size n; the
CLT on the other hand prescribes a limiting distribution as n .
4. As noted earlier, the t-distribution approaches the standard normal distribution as (hence, n) .
5. The normality assumption is not too severe, however; when samples

472

Random Phenomena
are from non-normal populations, the distribution of the T statistic is
still fairly close to the Students t-distribution.

Let us illustrate the application of this result with the following example.
Example 13.4: MEAN DIAMETER OF BALL BEARINGS
A manufacturer of low precision 10 mm diameter ball bearings periodically takes samples and measures them to conrm that the manufacturing process is still on target. A random sample of 20 ball bearings
resulted in diameter measurements with an average of x
= 9.86 mm,
and a standard deviation s = 1.01 mm. Postulating that the random
sample X1 , X2 , . . . , X20 is from a Gaussian distribution with = 10,
will fall to either side of
nd the probability that any sample mean, X,
the postulated mean by the observed amount or more, by chance alone.
9.86) or P (X
10.14).
i.e. P (X
Solution:
Had the population variance, 2 , been given, we could have obtained
precisely (as another normal distrithe sampling distribution for X
bution with = 10 and variance 2 /20); these results could have
been used directly to compute the required probabilities. However,
since the population variance is not given, the required probability
9.86) + P (X
10.14) is determined using the T statistic:
P (X

|
n|X
20|9.86 10|

P
S
1.01
P (T 0.62) + P (T 0.62)

0.542

9.86) + P (X
10.14)
P (X

(13.40)

Again, the probabilities are obtained directly from MINITAB using the
cumulative probability computation option for the t-distribution, with
= 19 degrees of freedom to give FT (0.62) = 0.271, and by symmetry,
P (T 0.62) = 0.271 to obtain the result in Eq (13.40). (See Fig 13.3.)
The implication is that, under the postulate that the ball bearings
are truly 10 mm in diameter, there is a fairly substantial 54% chance
that the observed sample average misses the target of 10 mm to the left
(comes in as 9.86 or less) or to the right by the same amount (comes in as
10.14 or more) purely at random. In other words, by pure chance alone,
one would expect to see this sort of observed deviation of the sample
mean from the true postulated (target) mean diameter value of 10 mm,
more than half the time. The inclination therefore is to conclude that
there is no evidence in this sample data that the process is o-target.

Again, as with the previous example, this one also provides a preview of what
is ahead in Chapters 14 and 15.

Sampling

473

0.4
T, df=19

Density

0.3

0.2

0.1
0.271
0.0

0.271

-0.620

0
X

0.62

FIGURE 13.3: Sampling distribution of the mean diameter of ball bearings in Example
10| 0.14) = P (|T | 0.62)
13.4 used to compute P (|X

13.4

Sampling Distribution of the Variance

Similar to the preceding discussion on the distribution of the mean, X,

of a random sample (X1 , X2 , . . . , Xn ), we observe that the sample variance,
dened as,
n
2
(Xi X)
2
(13.41)
S = i=1
(n 1)
is a random variable that is clearly related to the population variance, but
whose specic value, s2 , will vary from sample to sample. To make objective
inference statements about the population variance from this sample quantity,
we also need a theoretical distribution of S 2 . For this we have the following
three results:
1. Sample Variance: Let X1 , X2 , . . . , Xn be a random sample taken from
a population with an arbitrary pdf, f (x), having a mean and variance
2 . The sample variance dened as in Eq (13.41) is itself a random
variable whose expectation is 2 ; i.e.
E[S 2 ] = 2

(13.42)

2. Sampling Distribution of Single Variance: If the random sample

(X1 , X2 , . . . , Xn ) is from a normal population, then the random variable:
C=

(n 1)S 2
2

(13.43)

474

Random Phenomena
has a 2 (n 1) distribution. Such a distribution can be used to compute
probabilities concerning the random variable, S 2 . Unfortunately, nothing
so explicit can be said about sampling distributions of variances for
random samples drawn from non-normal populations.

3. Sampling Distribution of Two Variances: Let S12 and S22 be the

variances of two sets of independent random samples of respective sizes
n1 and n2 , each drawn from two normal populations having the same
variance. Then the random variable:
F =

S12
S22

(13.44)

has the Fisher F (1 , 2 ) distribution, with degrees of freedom 1 = (n1

1) and 2 = (n2 1). Again, for samples from non-normal distributions,
these results do not hold.
As with the sampling distributions of the mean, it has been customary to
compute probabilities from these distributions using 2 (n 1) and F tables.
It is now more common to use computers instead of tables, as illustrated with
the following examples.
Example 13.5: VARIANCE OF BALL BEARINGS DIAMETER MEASUREMENTS
The random sample of 20 ball bearings diameters X1 , X2 , . . . , X20 , postulated in Example 13.4 as being from a Gaussian distribution with
= 10, resulted in an average measured diameter of x
= 9.86 mm, and
a standard deviation s = 1.01 mm. Now consider the case in which the
design standard deviation for the manufacturing process is specied as
= 0.9; compute the probability that any sample standard deviation
will equal or exceed the observed value of S = 1.01 mm if the manufacturing process is still operating true to the original design specications.
Solution:
This problem requires computing the probability P (S 1.01) using the
chi-square statistic, C; i.e.

19(1.01)2
P (S 1.01) = P C
(0.9)2
= P (C 23.93) = 0.199
(13.45)
where the indicated probability was obtained directly from MINITAB
as with all the other earlier examples: the cumulative probability calculation option for a Chi-square distribution (Calc > Prob Dist >
Chi-Square), with degrees of freedom = 19; and input constant 23.93
to obtain P (C 23.93) = 0.801 to yield P (C 23.93) = 1 0.801 =
0.199. (See Fig 13.4.)
Thus, under the postulate that the design standard deviation is 0.9,
the sampling distribution of the variance indicates a fairly high 20%

Sampling

475

0.07
0.06

f(x)

0.05

F df=19

0.04
0.03
0.02
0.01
0.00

0.199
0

23.93

FIGURE 13.4: Sampling distribution for the variance of ball bearing diameters in
Example 13.5 used to compute P (S 1.01) = P (C 23.93)
chance of obtaining, purely at random, sample variances that are 1.01
or higher, even when the process is operating as designed.
Example 13.6: VARIANCE OF BALL BEARINGS DIAMETER MEASUREMENTS: TWO RANDOM SAMPLES
A second random sample of 20 ball bearings taken a month after the
sample examined in Examples 13.3 and 13.4 yielded an average measured diameter of x
2 = 10.03 mm, and a standard deviation s2 = 0.85
mm. Find the probability that the process operation remains essentially
unchanged, in terms of the observed sample standard deviation, even
though the newly observed sample standard deviation is less than the
value observed a month earlier. All assumptions in Example 13.4 hold.
Solution:
In this case, we are concerned with comparing two sample variances,
S12 from the previous month (with the specic value of (1.01)2 ) and S22 ,
the most recent, with specic value (0.85)2 . Since the real question is
not whether one value is greater than the other, but whether the two
sample variances are equal or not, we use the F -distribution with degrees
of freedom (19,19) to obtain the probability that F (1.01)2 /(0.85)2 )
(or, vice-versa, that F (0.85)2 /(1.01)2 ). The required probability is
obtained as:
P (F 1.41) + P (F 0.709) = 0.460
The indicated probabilities are, once again, obtained directly from
MINITAB for the F -distribution with 1 = 2 = 19 degrees of freedom; (See Fig 13.5.) The implication is that there is almost a 50%
chance that the observed dierence between the two sample variances

476

Random Phenomena

1.0

Density

0.8

0.6
F, df1=19, df2=19

0.230

0.4

0.2
0.230
0.0

0.709

1.41

FIGURE 13.5: Sampling distribution for the two variances of ball bearing diameters
in Example 13.6 used to compute P (F 1.41) + P (F 0.709)
will occur purely at random. We conclude, therefore, that there does
not appear to be any evidence that the process operation has changed
in the past month since the last sample was taken.

With these concepts in place, we are now in a position to discuss the two
aspects of statistical inference, beginning with Estimation in the next chapter,
followed by Hypothesis Testing in Chapter 15.

13.5

Summary and Conclusions

Because of its foundational role in inductive statistics, samplingthe study

and characterization of samples drawn from a populationwas considered in
this chapter by itself rst, as a precursor to a formal treatment of inductive
statistics proper, i.e., estimation and hypothesis testing. The entire chapter
was devoted to one problem: how to obtain a probability distribution for the
sample (or more precisely, functions thereof), given the population distribution.
The concept of the statistic is central to sampling, although the full
import will not be realized until the next chapter when we consider the important problem of how to determine unknown population parameters from
sample data. For the purpose of this chapter, it was sucient to introduce the
statistic simply as any function of a random sample that does not contain
unknown population parameters; the rest of the chapter was then devoted to

Sampling

477

determining the distribution of such statistics. In particular, since in inductive statistics (as we shall soon discover in Chapters 14 and 15), the two most
important statistics are the sample mean and sample variance, the distributions of these quantities were characterized under various conditions, giving
rise to resultssome general, others specic only to normal populationsthat
are used extensively in the next two chapters. Of all the general results, the
one used most frequently arises from the Central Limit Theorem, through
which we know that, regardless of the underlying distribution, as the sample
size tends to innity, the sampling distribution of the sample mean tends to
the Gaussian distribution. This collection of results provides the foundation
for the next two chapters.
Here are some of the main points of the chapter again.
Sampling is concerned with the probabilistic characterization of nite
size samples drawn from a population; it is the statistical analog to
the probability problem of characterizing individual observations of a
random variable with a pdf.
The central concepts in sampling are:
The random sample: in principle, n independent random variables
drawn from a population in such a way that each one has an equal
chance of being drawn; the mathematical consequence is that if
f (xi ) is the pdf for the ith random variable, then the joint pdf of
the random sample is a product of the contributing pdfs;
The statistic: a function of one or more random variables that
does not contain an unknown population parameter.
The sampling distribution: the probability distribution of a statistic of interest; its determination is signicantly facilitated if the
statistic is a function of a random sample.
As a consequence of the central limit theorem, we have the general result
that as n , the distribution of the mean of a random sample drawn
from any population with known mean and variance, is Gaussian, greatly
enabling the probabilistic analysis of means of large samples.

REVIEW QUESTIONS
1. What is sampling?
2. As presented in Section 13.1 what are the three central concepts of sampling
theory?

478

Random Phenomena

3. What is a random sample? And what is the mathematical implication of the statement that X1 , X2 , . . . , Xn constitutes a random sample from a distribution that has
a pdf f (x)?
4. What is a statistic?
5. What is a sampling distribution?
6. What is the primary utility of a statistic and its distribution?
7. How is the discussion in Chapter 6 helpful in determining sampling distributions?
8. What is the sampling distribution of a linear combination of n independent Gaussian random variables with identical pdfs?
9. What is the sampling distribution of a sum of n independent 2 (r) random variables with identical pdfs?
is the mean of a random sample of size n from a population with mean X
10. If X
2
and what is V ar(X)?

and variance X
, what is E(X)
11. What is the standard error of the mean?
12. What is the central limit theorem as stated in Section 13.3? What are its implications in sampling theory?
13. In sampling theory, under what circumstances will the t-distribution be used
instead of the standard normal distribution?
14. What is the sampling distribution of the variance of a random sample of size n
drawn from a Gaussian distribution?
15. What is the sampling distribution of the ratio of the variances of two sets
of independent random samples of sizes n1 and n2 each drawn from two normal
populations having the same variance?

EXERCISES
Section 13.1
13.1 Given X1 , X2 , . . . , Xn , a random sample from a population with mean and
variance 2 , both unknown, determine which of the following functions of this random sample is a statistic and which is not.
*
Xi )1/n ;
(i) Y1 = ( n
i=1
n
(ii) Y2 = i=1 (Xi )2 ;

n
(iii) Y3 = n
i=1 i Xi ;
i=1 i = 1;

Sampling

479

Xi
(iv) Y4 = n
i=1 .
If the population mean is specied, how will this change your answer?
13.2 Given X1 , X2 , . . . , Xn , a random sample from a population with mean and
variance 2 , both unknown, dene the following statistic:
n

= 1
X
Xi
n i=1

as the sample mean. Determine which of the following functions of the random
variable are
which are not:
statisticsand
2
(i) Y1 = n
i=1 (Xi X) /(n 1);
2
(ii) Y2 = n
i=1 (Xi ) /n;
k /(n 1); k > 0;
|X

X|
(iii) Y3 = n
i
i=1

(iv) Y4 = n
i=1 (Xi X)/.
13.3 For each of the following distributions, given the population mean , and
variance 2 , derive the appropriate expressions for obtaining the actual pdf parameters (, , or n, p) in terms of and 2 : (i) Gamma(, ); (ii) Beta(, ); (iii)
Binomial(n, p).
Section 13.2
13.4 Given n mutually stochastically independent random variables, X1 , X2 , . . . , Xn ,
with respective pdfs N (1 , 12 ), N (2 , 22 ), . . . N (n , n2 ):
(i) Determine the distribution of the statistic:
Y = k1 X1 + k2 X2 + + kn Xn
where k1 , k2 , . . . , kn are real constants; and show that it is a Gaussian distribution,
N (y , y2 ) where
y

k1 1 + k2 2 + + kn n

k12 12 + k22 22 + + kn2 n2

(ii) From this result, show that if 1 = 2 = = n = , and 12 = 22 = = n2 =

2 , so that the random variables Xi ; i = 1, 2, . . . , n are all identically distributed,
then establish the result given in Eqs (13.15) and (13.16), i.e., that
n

ki
y =

y2

i=1
n

ki2

i=1

13.5 Given n mutually stochastically independent random variables, X1 , X2 , . . . , Xn ,

with identical pdfs that are unspecied except that the mean is , and the variance,
2 ; show that the mean and variance of the statistic dened as
Y =

n
1
Xi
n i=1

480

Random Phenomena

are given by
y

2 /n

and hence establish the results given in Eqs (13.17) and (13.18) in the text.
13.6 Given a random sample X1 , X2 , . . . , Xn , from a Gaussian N (, 2 ) distribution;
show that the random variable
2
n

Xi
Y =

i=1
has a 2 (n) distribution.
13.7 Given n mutually stochastically independent random variables, X1 , X2 , . . . , Xn ,
with respective pdfs 2 (r1 ), 2 (r2 ), , 2 (rn ); show that the random variable
Y = X1 + X2 + + Xn
has a 2 (r) distribution with degrees of freedom,
r = r1 + r2 + + rn .
13.8 Given a random sample X1 , X2 , . . . , Xn from a Poisson P() distribution,
determine the sampling distribution of the random variable dened as
Y =

n
1
Xi
n i=1

13.9 Given a random sample X1 , X2 , . . . , Xn from a Gamma (, ) distribution,

determine the sampling distribution of the random variable dened as
Y =

i=1

Section 13.3
13.10 Given that X1 , X2 , . . . , Xn constitute a random sample from a population
with mean , and variance 2 , dene two statistics representing the sample mean
as follows:

n
1
Xi
n i=1
n

i=1

i Xi ; with

(13.46)
n

i = 1

(13.47)

i=1

where the rst is a regular mean and the second is a weighted mean. Show that
= and also that E(X)
= ; but that V ar(X)
V ar(X).

E(X)
13.11 If X1 , X2 , . . . , Xn is a random sample from a Poisson P() distribution, nd
the sample mean dened in Eq 14.69. (Hint: See Example 6.1
the distribution of X,

Sampling

481

and V ar(X).

in Chapter 6.) Determine E(X)

13.12 If X1 , X2 , . . . , Xn is a random sample from a Gamma (, ) distribution,
the sample mean dened in Eq 14.69. Determine E(X)

nd the distribution of X

and V ar(X).
13.13 If X1 , X2 , . . . , Xn is a random sample from a Lognormal L(, ) distribution,
(i) Find the distribution of the geometric mean

g =
X

1/n
Xi

(13.48)

i=1

g ) and V ar(ln X
g ) .
(ii) Determine E(ln X
13.14 Given a random sample of 10 observations drawn from a Gaussian population
with mean 100, and variance 25, compute the following probabilities regarding the

sample mean, X:
100); (ii) P (X
100); (iii) P (X
104.5); (iv) P (96.9 X
103.1);
(i) P (X
101.6). Will the sample size make a dierence in the distribution
(v) P (98.4 X
used to compute the probabilities?
13.15 Refer to Exercise 13.14. This time, the population variance is not given;
instead, the sample variance was obtained as 24.7 from the 10 observations.
(i) Compute the ve probabilities.
(ii) Recompute the probabilities if the sample size increased to 30 but the sample
variance remained the same.
13.16 A sample of size n is drawn from a large population with mean and variance
2 but unknown distribution;
(i) Determine the mean and variance of the sample mean when n = 10; = 50; 2 =
20;
(ii) Determine the mean and variance of the sample mean when n = 20; = 50; 2 =
20;
(iii) Determine the probability that a sample mean obtained from a sample of size
n = 50 will not deviate from the population mean by more than 3. State any
assumption you may need to make in answering this question.
13.17 A random sample of size n = 50 is taken from a large population with mean
15 and variance 4, but unknown distribution.
(i) What is the standard deviation X of the sample mean?
(ii) If the sample size were reduced by 50%, what will be the new standard deviation
X of the sample mean?
(iii) To reduce the standard deviation to 50% of the value in (i), what sample size
will be needed?
Section 13.4
13.18 The variance of a sample of size n = 20 drawn from a normal population with
mean 100 and variance 10 was obtained as s2 = 9.5.
(i) Determine the probability that S 2 10.

482

Random Phenomena

(ii) Determine the probability that S 2 9.5.

(iii) If the sample size increased by 50% but the computed sample variance remained
the same, recompute the probability S 2 9.5.
(iv) Repeat part (iii) when the sample size is decreased from the original value of
50 by 50%.
13.19 Two samples were taken from the same normal population with mean 100
and variance 10. One sample, of size 20, had a sample variance S12 = 11.2; the other
of size 30, a sample variance of S22 = 9.8.
(i) Determine the following probabilities: P (S12 9.8) and P (S22 11.2).
(ii) Determine the specic variate 20 for this problem such that for n = 20, P (C
20 ) = 0.05 where C is a 2 (n) distributed random variable.
(iii) Determine the specic variate 20 for this problem such that for n = 30,
P (C 20 ) = 0.05 where, as in (ii), C is a 2 (n) distributed random variable.
13.20 Refer to Exercise 13.19. Use an appropriate distribution to determine the
probability of observing a ratio 11.2/9.8, or greater, if the two sample variances are
equal.
13.21 Two samples of equal size n = 51 are drawn from normal populations with the
same variance. One sample variance was S12 = 15; the other, S22 = 12.0. Determine
the following probabilities:
(i) P (S12 /S22 1); and P (S22 /S12 1)
(ii) P (S12 /S22 1.25) and P (S22 /S12 0.8)

APPLICATION PROBLEMS
13.22 The following data set, from a study by Lucas (1985)1 , shows the number of
accidents occurring per quarter (three months) at a DuPont company facility, over
a 10-year period. The data set has been partitioned into two periods: Period I is the
rst ve-year period of the study; Period II, the second ve-year period.

5
4
2
5
6

Period I
5 10 8
5 7
3
8 6
9
6 5 10
3 3 10

3
1
7
1
4

Period II
4 2 0
3 2 2
7 1 4
2 2 1
4 4 4

(i) Why is a Poisson pdf a reasonable model for representing this data?
(ii) Treat the entire 10-year data as a single block and the data shown as specic
observations {xi }40
i=1 of a random sample Xi ; i = 1, 2, . . . , 40, from a single Poisson
population with unknown parameter . Obtain a numerical value for the sample
use it as an estimate of the unknown population parameter, , to produce
mean, X;
1 Lucas

J. M., (1985). Counted Data CUSUMs, Technometrics, 27, 129144

Sampling

483

a complete pdf, f (x|), as a candidate description of the population. From this

complete pdf, compute the probabilities f (x) for x = 1, 2, . . . , 10.
(iii) Using the value estimated for the population parameter in (ii), determine the
(See Exercise 13.11.) From this distriprecise (not approximate) distribution of X.

bution for X, compute P (X 3) and P (X 6).

(iv) Using the candidate population description obtained in (ii), compute P (X 6)
and P (X 3). (Note that these probabilities are with respect to the random variable X itself, not the sample mean.) Are the 20 observations in Period I and those
in Period II consistent with these theoretical probabilities? Comment on what these
results imply about whether or not the postulated population model is plausible.
13.23 Refer to Application Problem 13.22. This time consider each period as
two separate blocks of specic observations {xi }20
i=1 of a random sample Xi ; i =
1, 2, . . . , 20 from two distinct Poisson populations with unknown parameters 1 for
Period I and 2 for Period II. Obtain numerical values for the two sample means,
1 and
2 for population parameters 1 and
2 and use these are estimates
1 and X
X
1 and X
2 and use
2 respectively. Determine the exact sampling distributions for X
these to compute the following probabilities:
1 ) and
1 3|1 =
(i) P (X
2 ).
2 6|2 =
(ii)P (X
Comment on what these results mean in terms of the conjecture that these two
populations may in fact be dierent.
13.24 Refer to Application Problem 13.22 and consider the two propositions that
(i) the data for Period I represent DuPont facility operation when the true mean
number of accident occurrences per quarter is 6; and that (ii) by Period II, the
true mean number of accident occurrences has been reduced by 50%. By computing
appropriate probabilities and carrying out appropriate comparisons, argue for or
against the statement: These two propositions are consistent with the data.
13.25 Refer to Application Problem 13.22 and consider the data from the two periods separately as random samples from two distinct Poisson populations with
parameters = 6 for Period I and = 3 for Period II
2 for Period II,
1 for Period I and X
(i) Using the exact sampling distribution for X
2 4.5)
1 4.5) and P (X
determine P (X
(ii) Use the approximate Gaussian result arising from the central limit theorem to
2 for Period II; recom 1 for Period I and X
obtain the sampling distributions for X
pute the probabilities in (i) and compare the two sets of results. Comment on the
accuracy of the approximate Gaussian sampling distributions in this particular case.
13.26 The waiting time (in days) until the occurrence of a recordable safety incident
in a certain companys manufacturing site is known to be an exponential random
variable with unknown parameter . As part of a safety performance characterization
program, the following data set was obtained
S1 = {16, 1, 9, 34, 63, 44, 1, 63, 41, 29}
(i) Considering this as a specic realization of a random sample of size n = 10,
and obtain an exact
determine a numerical value, x
, for the random sample mean, X,

484

Random Phenomena

sampling distribution for the sampling mean in terms of the unknown population
parameter, , and sample size, n.
(ii) The company claims that its true mean waiting time between safety incidents
is 40 days. From the sampling distribution and the specic value x
obtained in (i),
x
determine P (X
| = 40).
(iii) An independent safety auditor estimates that from the companys records, operating procedures, and other performance characteristics, the true mean waiting time
x
between safety incidents is more likely to be 30 days. Determine P (X
| = 30).
Compare this probability to the one obtained in (ii) and comment on which postulated value for the true mean waiting time is more believable, = 40 as claimed by
the company, or = 30 as claimed by the independent safety auditor.
13.27 Refer to Application Problem 13.26. From the pdf of the exponential E (40)
random variable, determine P (X 10). Recompute P (X 10) using the pdf for
the exponential E (30) random variable. Which of these results is more consistent
with the data set S1 ? Comment on what this implies about the more likely value
for the population parameter.
13.28 Refer to the data in Table 1.1 in Chapter 1, which shows 50 samples of the
random variables YA and YB , yields obtained from each of two competing chemical
processes. Specic values were obtained for the sample means in Chapter 1 as yA =
75.52 and yB = 72.47. Consider the proposition that the YA data is from a normal
population with the distribution N (75.5, 1.52 ).
(i) Determine the sampling distribution for YA and from it compute P (75.0 YA
76.0).
(ii) Consider the proposition that there is no dierence between the yields obtainable
from each process. If this is so, then the YB data should also be from the same normal
population as that specied for YA . Using the sampling distribution obtained in (i),
compute P (YB 72.47). Comment on what this implies about the plausibility of
this proposition.
(iii) Using the sampling distribution in (i), determine the value of 0 for which
P (YA 0 ) = 0.05
Compare this value with the computed value, yB = 72.47. What does this result
imply about the plausibility of the proposition that both data sets come from the
same population with the same distribution?
13.29 Refer to Application Problem 13.28; this time consider an alternative proposition that in fact the YB is from a normal population with the distribution
N (72.5, 2.52 ). Determine the sampling distribution for YB and from it compute
P (YB 72.47) as well as P (72.0 YB 73.0). Comment on what these results
imply about the plausibility of this alternative proposition.
13.30 A manufacturer of 10 mm diameter ball bearings uses a process which, when
operating properly, is calibrated to produce ball bearings with mean diameter =
10.00 mm and standard deviation = 0.10 mm. In order to evaluate the performance
of the process at a particular point in time, a random sample of n ball bearings is to
be taken and the diameters determined in a quality control laboratory. Determine

Sampling

485

the sample size n such that

10.00 + 1.96X ) = 0.05
P (10.00 1.96X X
State whatever assumpwhere X is the standard deviation of the sample mean, X.
tions you may have to make in answering this question.
13.31 Refer to Application Problems 13.30. Consider that the standard practice is
for the quality control lab to select a sample of ns ball bearings, compute the specic
value for the sample mean, x
, and plot it on a chart with the following characteristics:

a center line representing = 10.00; an upper limit line set at (10 + 3/ ns ) and

a lower limit set at (10 3/ ns ). The process is deemed to be performing as

expected if the value obtained for x
falls within the limits.
(i) For a sample of 4 ball bearings, where are the upper and lower limits lines located?
What is the probability of x
falling outside these limits when the process is in fact
operating as expected.
(ii) If a process disturbance shifted the true mean diameter for the manufactured
ball bearings to 10.10 mm, what is the probability of detecting this shift when the
result obtained from the next sample of 4 ball bearings is analyzed?
State any assumptions needed to answer these questions.
13.32 The sample variance for the yield data presented in Chapter 1 and in Application Problem 13.28 is given s2A = 2.05 for process A, and s2B = 7.62 for process
B. Consider the proposition that the YA data is from a normal population with the
distribution N (75.5, 1.52 ).
2
1.52 );
(i) Determine P (SA
(ii) If it is postulated that the YB data is from the same population as the YA data,
2
by as much as,
determine the probability of overestimating the sample variance SB
2
or worse than, the obtained value of sB = 7.62. Interpret your result in relation to
the plausibility of this postulate.
(iii) Consider the alternative postulate that the YB data actually came from a normal population with the distribution N (72.5, 2.52 ). Now recompute the probability
2
by as much as, or worse than, the obtained
of overestimating the sample variance SB
value of s2B = 7.62. Interpret this new result in relation to the plausibility of this
alternative postulate.
13.33 Refer to Application Problem 13.32 where the sample variances are given as
s2A = 2.05 for process A, and s2B = 7.62 for process B. Now consider the postulate
that the two sets of samples are random samples from the same normal population
with the same, but unspecied variance. Determine the probability that a sample
2
for process B will exceed that for process A by as much as the values
variance SB
observed in this specic sample, or more. Comment on the implications of this result
on the plausibility of this proposition.
13.34 Random samples of size 10 each are taken from large groups of trainees instructed by Method A and Method B, and each trainees score on an appropriate
achievement test is shown below.
Method A
Method B

71
72

75
77

65
84

69
78

73
69

66
70

68
77

71
73

74
65

68
75

486

Random Phenomena

Consider the postulate that these data came from the same normal population
with mean = 70 but whose variance is unspecied.
(i) If this is true, what is the probability that the mean of any random sample of
trainee scores will exceed 74? Interpret this result in light of individual sample means
of Method A and Method B scores. How plausible is this postulate?
(ii) Now consider an alternative postulate that scores obtained by trainees instructed
by Method B are actually drawn from a normal population with mean
B = 75.
(75 + 2sB / 10)] and
Determine the limits of the interval [(75 2sB / 10) X
the probability that the mean score of any other random sample of 10 from this
population of trainees instructed by Method B will fall into this interval. Where
A lie in relation to this interval?
does the value obtained for the sample mean X
Discuss the implications of these results on the plausibility of this new postulate.

Chapter 14
Estimation

14.1 Introductory Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14.1.1 An Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.1.2 Problem Denition and Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.2 Criteria for Selecting Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.2.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.2.2 Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.2.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.3 Point Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.3.1 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.3.2 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Maximum Likelihood Estimate of Gaussian Population Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Important Characteristics of MLEs . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.4 Precision of Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.5 Interval Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.5.1 General Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.5.2 Mean of a Normal Population; Known . . . . . . . . . . . . . . . . . . . . . . . . .
14.5.3 Mean of a Normal Population; Unknown . . . . . . . . . . . . . . . . . . . . . . .
14.5.4 Variance of a Normal Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.5.5 Dierence of Two Normal Populations Means . . . . . . . . . . . . . . . . . . . .
14.5.6 Interval Estimates for Parameters from other Populations . . . . . . .
Means; Large Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Means; Small Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.6 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.6.2 Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.6.3 Bayesian Estimation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.6.4 A Simple Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Prior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Posterior Distribution and Point Estimates . . . . . . . . . . . . . . .
14.6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Bayesian Controversy: Subjectivity . . . . . . . . . . . . . . . . . . . . . . . .
Recursive Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Choosing Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Computational Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.7 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

488
488
489
490
490
491
492
493
493
496
499
501
503
506
506
507
508
509
512
514
514
515
518
518
519
520
521
521
521
522
524
524
525
525
526
527
528
530
537

Life is the art of drawing sucient conclusions

from insucient premises
Samuel Butler (18351902)

487

488

Random Phenomena

With the sampling theory foundation now rmly in place, we are nally in a
position to begin building the two-tiered statistical inference edice, starting
with the rst tier, Estimation, in this chapter, and nishing with Hypothesis
Testing in the next chapter. The focus in the rst half of this chapter is on how
to determine, from incomplete information in sample data, unknown population parameter values needed to complete the characterization of the random
variable with the pdf f (x|). The focus in the second complementary half is
on how to quantify the unavoidable uncertainty arising from the variability
in nite samples. Just as estimation theory relies on sampling theory, so does
the theory of hypothesis testing rely on both estimation theory and sampling
theory; the material in this chapter therefore also serves as an important link
in the statistical inference chain.

14.1
14.1.1

Introductory Concepts
An Illustration

Consider an opinion pollster who states that 75% of undergraduate chemical engineering students in the United States prefer closed-book exams to
opened-book ones, and adds a margin of error of 8.5% to this statement.
It is instructive to begin our discussion by looking into how such information
is obtained and how such statements are really meant to be interpreted.
First, in the strictest possible sense of the formal language of statistics, the
population of concern is the opinion of all undergraduate chemical engineering students in the United States. However, in this case, many oftenbut
somewhat impreciselyconsider the population as the students themselves
(perhaps because of how this aligns with the more prevalent sociological concept of population). Either way, observe that we are dealing with a technically nite population (there is, after all, an actual, nite and countable
number of individuals and their opinions). Practically, however, the size of
this population is quite large and it is dicult (and expensive) to obtain the
opinion of every single individual in this group. The pollster simply contacts
a pre-determined number of subjects selected at random from this group, and
asks for their individual opinions regarding the issue at stake: preference for
closed-book versus opened-book exams. The premise is that there is a true,
but unknown, proportion, c , that prefers closed-book exams; and that results
obtained from a sample of size n can be used to deduce what c is likely to be.
Next, suppose that out of 100 respondents, 75 indicated a preference for closedbook exams with the remaining 25 opting for the only other alternative. The
main conclusion stated above therefore seems intuitively reasonable since
indeed this sample shows 75 out of 100 expressing a preference for closed-book
exams. But we know that sampling a dierent set of 100 students will quite

Estimation

489

likely produce a dierent set of results. This possibility is captured by the

added margin of error 8.5%.
This illustration raises some fundamental questions: Even though intuitive,
on what basis is this analysis of the survey data considered as reasonable?
How was the margin of error determined? In general, how does one determine
unknown population parameters for problems that may not necessarily be as
intuitive as this one? Providing answers to such questions is the objective of
this chapter.

14.1.2

Problem Denition and Key Concepts

Estimation is the process by which information about the value of a population parameter (such as c in the opinion survey above) is extracted from
sample data. Because estimation theory relies heavily on sampling theory, the
samples used to provide population parameter estimates are required to be
random samples drawn from the population of interest. As we show shortly,
this assumption signicantly simplies estimation.
There are two aspects of estimation:
1. Point Estimation: the process for obtaining a single best value for a
population parameter;
2. Interval Estimation: the process by which one obtains a range of values
that will include the true parameter, along with an appended degree of
condence.
Thus, in terms of the opinion poll illustration above, the point estimate of c
is given as c = 0.75. We have introduced the hat notation . to dierentiate
an estimate from the true but unknown parameter). On the other hand, the
interval estimate, will be rendered as c = 0.75 0.085, or 0.665 < c < 0.835,
to which should be appended with 95% condence (even though this latter
appendage is usually missing in statements made for the public press).
The problem at hand may now be formulated as follows: A random variable X has a pdf f (x; ), whose form is known but the parameters it contains,
, are unknown; to be able to analyze X properly, f (x; ) needs to be completely specied, in the sense that the parameter set, , must be determined.
This is done by inferring the value of the parameter vector, , from sample
data, specically, from {x1 , x2 , . . . , xn }, specic values of a random sample,
X1 , X2 , . . . , Xn , drawn from the population with the pdf f (x; ).
The following are four key concepts that are central to estimation theory:
1. Estimator : Any statistic U (X1 , X2 , . . . , Xn ) used for estimating the unknown quantity , or g(), a function thereof;
2. Point Estimate: Actual observed value u(x1 , x2 , . . . , xn ) of the estimator
using specic observations x1 , x2 , . . . , xn ;

490

Random Phenomena

3. Interval Estimator : Two statistics, UL (X1 , X2 , . . . , Xn ) < UR (X1 , X2 , . . . , Xn ),

such that {UL (X1 , X2 , . . . , Xn ), UR (X1 , X2 , . . . , Xn )} represents an interval that will contain the unknown (or g()), with a probability that
can be computed;
4. Interval Estimates: Actual values, uL (x1 , x2 , . . . , xn ) and uR (x1 , x2 , . . . , xn ),
determined for the respective interval estimators from specic observations x1 , x2 , . . . , xn .

14.2

Criteria for Selecting Estimators

By denition, estimators are statistics used to estimate unknown population parameters, , from actual observations. Before answering the question:
how are estimators (and estimates) determined? we wish to consider rst how
to evaluate estimators. In particular, we will be concerned with what makes
a good estimator, and what properties are desirable for estimators.

14.2.1

Unbiasedness

A statistic U (X1 , X2 , . . . , Xn ) is said to be an unbiased estimator of g()

if
E[U (X1 , X2 , . . . , Xn )] = g().

(14.1)

We know intuitively that this is a desirable property; it means, roughly, that

on average, the estimator will produce an accurate estimate of the unknown
parameter.
Example 14.1: UNBIASED ESTIMATORS OF POPULATION MEAN
Given a random sample X1 , X2 , . . . , Xn from a population with an unknown mean, ,
(1) show that the sample average,
n

= 1
Xi ,
X
n i=1

(14.2)

is an unbiased estimator for .

(2) Also show that any other weighted average dened as
=
X

i Xi

(14.3)

i=1

is also unbiased for , so long as

n

i=1

i = 1

(14.4)

Estimation

491

0.4

f(x)

0.3

0.2

0.1
U2
0.0

-15

-10

-5

0
X

FIGURE 14.1: Sampling distribution for the two estimators U1 and U2 : U1 is the more
ecient estimator because of its smaller variance
Solution:
(1) By denition of the expectation, we obtain from Eq (14.2):
=
E[X]

n
1
E[Xi ] = ,
n i=1

(14.5)

is indeed unbiased for .

establishing that X
(2) By the same token, we see that
] =
E[X

i E[Xi ] =

i=1

provided that

14.2.2

n
i=1

i =

(14.6)

i=1

i = 1 as required.

Eciency

If U1 (X1 , X2 , . . . , Xn ) and U2 (X1 , X2 , . . . , Xn ) are both unbiased estimators for g(), then U1 is said to be a more ecient estimator if
V ar(U1 ) < V ar(U2 )

(14.7)

See Fig 14.1. The concept of eciency roughly translates as follows: because
of uncertainty, estimates produced by U1 and U2 will vary around the true
value; however, estimates produced by U1 , the more ecient estimator will
cluster more tightly around the true value than estimates produced by U2 . To
understand why this implies greater eciency, consider a symmetric interval
of arbitrary width, , around the true value g(). Out of 100 estimates

492

Random Phenomena

produced by each estimator, because of the smaller variance associated with

its sampling distribution, on average, the proportion of the estimates that will
fall into this region will be higher for U1 than for U2 . The higher percentage of
the U1 estimates falling into this region is a measure of the higher estimation
eciency. Thus estimates produced by U1 will, on average, be closer to the
truth in absolute value than a corresponding number of estimates produced
by U2 .
There are cases of practical importance for an unbiased estimator based
on a random sample of size n, (say U (n)) drawn from a population with
pdf f (x; ) where there exists a smallest achievable variance. Under certain
regularity conditions, we have the following result:
2

'

V ar[U (n)]
nE

g()

ln f (x;)

(2 ,

(14.8)

generally known as the Cramer-Rao inequality, with the quantity on the RHS
known as the Cramer-Rao (C-R) lower bound. The practical implication of
this result is that no unbiased estimator U (n) can have variance lower than
the C-R lower bound. An estimator with the minimum variance of all unbiased estimators (whether it achieves the C-R lower bound or not) is called a
Minimum Variance Unbiased Estimator (MVUE).
is more ecient than X
; in fact, it
Of the estimators in Example 14.1, X
is the most ecient of all unbiased estimators of .
can be shown that X

14.2.3

Consistency

By now, we are well aware that samples are nite subsets of populations
from which they are drawn; and, as a result of unavoidable sampling variability, specic estimates obtained from various samples from the same population
will not exactly equal the true values of the unknown population parameters
they are supposed to estimate. Nevertheless, it would be desirable that as the
sample size increases, the resulting estimates will become progressively closer
to the true parameter value, until the two ultimately coincide as the sample
size becomes innite.
Mathematically, a sequence of estimators, Un (X), n = 1, 2, . . ., where n is
the sample size, is said to be a consistent estimator of g() if
lim P (|Un (X) g()| < ) = 1

(14.9)

for every > 0. According to this denition, a consistent sequence of estimators will produce an estimate suciently close to the true parameter value if
the sample size is large enough.
Recall the use of Chebyshevs inequality in Chapter 8 to establish the
(weak) law of large numbers, specically: that the relative frequency of success (the number of successes observed per n trials) approaches the actual

Estimation

493

probability of success, p, as n , with probability 1. This statement may

now be interpreted as implying that the ratio X/n, where X is the binomial total number of successes observed in n Bernoulli trials, constitutes a
consistent sequence of estimates of the population probability of success, p.
This
n statement can be extended to sample means in general: sample means
i=1 Xi /n constitute a consistent sequence of estimators for the population
mean when the pdf has nite variance (See Exercise 14.10).
We now proceed to discuss various ways by which to obtain point estimators.

14.3

Point Estimation Methods

14.3.1

Method of Moments

If f (x; ) is the pdf of a random variable, X, with unknown parameters ,

then as stated in Chapter 4, mi , the theoretical ith ordinary moment of X,
dened by:
(14.10)
mi = E[X i ]
will be a known function of , say, mi (). The method of moments entails
obtaining from a random sample, X1 , X2 , . . . , Xn , the k th sample moment
1 k
X
n i=1
n

Mk =

(14.11)

for k = 1, 2, . . ., as needed, equating each to the corresponding theoretical

moment, and determining the values of the unknown parameters required to
achieve moment matching. Solving the resulting equations,
Mi = mi ()

(14.12)

for in terms of the random sample X1 , X2 , . . . , Xn , will yield:

= h(X1 , X2 , . . . , Xn );

(14.13)

known as the vector of method of moment estimators. The following examples

illustrate these concepts.
Example 14.2: METHOD OF MOMENT ESTIMATION: EXPONENTIAL DISTRIBUTION
Let X1 , X2 , . . . , Xn be a random sample from the exponential distribution with pdf
1
(14.14)
f (x; ) = ex/

Estimate , the unknown parameter from this random sample using the

494

Random Phenomena
method of moments.
Solution:
Since there is only one parameter to be estimated, only one moment
equation is required. Let us therefore choose the rst moment, which,
by denition, is
(14.15)
m1 = = E[X] =
The sample analog of this theoretical moment is
n

= 1
M1 = X
Xi
n i=1

(14.16)

and upon equating (14.15) and (14.16), we obtain:

n

= 1
Xi
= X
n i=1

(14.17)

where the hat has been introduced to indicate an estimate and distinguish it from its true but unknown value.
Thus, the method of moments estimator for the exponential parameter is:
n
1
Xi
(14.18)
UM M (X1 , X2 , . . . , Xn ) =
n i=1

When specic data sets are obtained, specic estimates of the unknown parameters are obtained from the estimators by substituting the observations
x1 , x2 , . . . , xn for the random variables X1 , X2 , . . . , Xn , as illustrated in the
following examples.
Example 14.3: SPECIFIC METHOD OF MOMENT ESTIMATES: EXPONENTIAL DISTRIBUTION
The waiting time (in days) until the occurrence of a recordable safety
incident in a certain companys manufacturing site is known to be an exponential random variable with an unknown parameter . In an attempt
to improve its safety record, the company embarked on a safety performance characterization program which involves, among other things,
tracking the time in between recordable safety incidents.
(1) During the rst year of the program, the following data set was
obtained:
(14.19)
S1 = {16, 1, 9, 34, 63, 44, 1, 63, 41, 29}
which translates as follows: 16 days elapsed before the rst recordable
event occurred; 1 day thereafter the second event occurred; the third
occurred 9 days after, and the fourth, 34 days after, etc. From this data
record, obtain a method of moments estimate of the parameter , the
mean time between safety incidents.
(2) The data record for the second year of the program is:
S2 = {35, 26, 16, 23, 54, 13, 100, 1, 30, 31}

(14.20)

Obtain the method of moment estimate of the parameter for the

Estimation

495

second-year safety performance.

Solution:
(1) From the results in Example 14.2, especially, Eq (14.17), the required
estimate is obtained as:
= (16 + 1 + 9 + + 29)/10 = 30.1
= X

(14.21)

implying an average of 1 incident per 30 days.

(2) From the second year data set, we obtain
= (35 + 26 + 16 + + 31)/10 = 32.9
= X

(14.22)

At this point, it is not certain whether the dierence between the two
estimates is due primarily to random variability in the sample or not
(the sample size, n = 10 is quite small). Is it possible that a truly
signicant change has occurred in the companys safety performance
and this is being reected in the slight improvement in the average
waiting time to the occurrence of recordable incidents observed in the
second-year data? Hypothesis testing, the second aspect of statistical
inference, provides tools for answering such questions objectively.

When there are additional parameters to estimate, the number of moments

required for the estimation must naturally increase to match the number of
unknown parameters, as we illustrate with the next example.
Example 14.4: METHOD OF MOMENT ESTIMATION:
GAUSSIAN DISTRIBUTION
Let X1 , X2 , . . . , Xn be a random sample from a Gaussian distribution
with unknown mean and variance 2 . Estimate these unknown parameters by the method of moments, using information from this random
sample.
Solution:
Let (1 , 2 ) = (, ). Because there are two unknown parameters, we
need two moment equations to determine them. The theoretical equations for the rst two moments are:
m1
m2

=
=

E[X] =
2

(14.23)
2

E[X ] = +

(14.24)

where this last equation merely recalls the fact that 2 = E[X 2 ]
(E[X])2 . The sample analogs to these theoretical moment equations
are:
M1

n
1

Xi = X
n i=1

(14.25)

n
1 2
X
n i=1 i

(14.26)

496

Random Phenomena
And now, by equating corresponding theoretical and sample moment
equations and solving for the unknown parameters, we obtain, rst for
, the estimator
n
1

Xi = X
(14.27)
U1 =
n i=1
and for , the estimator
+

,
n
, 1
2
2
X
U2 =
(X)
n i=1 i

(14.28)

Once again, given any specic observations, we obtain the actual estimates corresponding to the observations by substituting the data
{x1 , x2 , . . . , xn } into these estimator equations.

Remarks:
1. Method of moments estimators are not unique. In Example 14.2, we
could have used the second moment instead of the rst, so that the
theoretical moment equation would have been
E[X 2 ] = V ar[X] + (E[X])2 = 2 2

(14.29)

As such, upon equating this to the sample moment and solving, we would
have obtained:
+
,
n
, 1

=X2
(14.30)
2n i=1 i
prescribed in Eq (14.17).
which, in general, will not equal the X
2. Thus, we cannot really talk about the method of moments estimator,
not even a method of moments estimator. Note that we could just as
easily have based this method on moments about the mean instead of
the ordinary moments (about the origin).
3. Nevertheless, these estimators are consistent (under most reasonable
conditions) in the sense that empirical moments converge (in probability) to the corresponding theoretical moments.

14.3.2

Maximum Likelihood

The method of maximum likelihood for estimating unknown population

parameters is best illustrated with a specic example rst, before generalizing.
Consider a random sample, X1 , X2 , . . . , Xn , drawn from a population possessing a Poisson distribution with unknown parameter, . By denition of a

Estimation

497

random sample, the n random variables, Xi ; i = 1, 2, . . . , n are all mutually

stochastically independent; they also all have the same pdf:
f (xi |) =

e xi
xi !

(14.31)

As such, the joint distribution of the random sample is:

x1 x2 xn
e
e
e
f (x1 , x2 , . . . , xn |) =

x1 !
x2 !
xn !

(14.32)

which simplies to
n

en i=1 xi
f (x1 , x2 , . . . , xn |) =
x1 !x2 ! xn !

(14.33)

We may now note the following points about Eq (14.33):

1. If we know the parameter , then from Eq (14.33) we can obtain the
probability that X1 = x1 , X2 = x2 , , Xn = xn jointly for various
specic values of x1 , x2 , . . . , xn . But is unknown, so that whatever
probability is calculated from this equation will be a function of ;
2. For any particular value of , one obtains a specic value for the probability that X1 = x1 , X2 = x2 , , Xn = xn jointly, which corresponds to this specied value of . Thus, once the actual observations
(x1 , x2 , . . . , xn ) are made, and the observed values introduced into Eq
(14.33), the resulting expression becomes a function of the unknown
parameter, ; it is then written as:
n

L() =

en i=1 xi
x1 !x2 ! xn !

(14.34)

and called the likelihood function of the sample.

3. Eq (14.34) is called the likelihood function because it represents the
likelihood of observing the specic outcome (x1 , x2 , . . . , xn ) for dierent
values of the parameter . This is to distinguish it from the joint pdf
which gives the probability of observing (x1 , x2 , . . . , xn ) when the value
of is given and xed.
Thus, with the joint pdf of a collection of random variables, X1 , X2 , . . . , Xn ,
(i.e. f (x1 , x2 , . . . , xn |)), the pdf parameter vector, , is given, but not the
specic observations x1 , x2 , . . . , xn ; with the likelihood function, the specic
observations are given, the parameter vector is not.
Let us return to the Poisson random sample example and pose the following question: how likely is it that a population with a specic parameter
produced the particular observed sample (x1 , x2 , . . . , xn )? Note that this
is the very question that the likelihood function L() in Eq (14.34) provides

498

Random Phenomena

answers to, for all conceivable values of . It now seems entirely reasonable
that maximizes the likelihood of observing the
to seek a value of (say )
values (x1 , x2 , . . . , xn ) and use it as an estimate of the unknown population
parameter. Such an estimate is known as the maximum likelihood estimate.
The interpretation is that of all possible values one could entertain for the
yields the highest possible
unknown parameter , this particular value, ,
is specied
probability of obtaining the observations (x1 , x2 , . . . , xn ), when

as the population parameter. Thus, maximizes the likelihood of observing

(x1 , x2 , . . . , xn ) in the sense that the population with the parameter =

is most likely to have produced the specic observations (x1 , x2 , . . . , xn ) (or
equivalently: the specic observations at hand, (x1 , x2 , . . . , xn ), are most likely

to have come from a population for which = ).

from Eq (14.34) requires satisfying the familiar dierential
Determining
calculus condition:
L
=0
(14.35)

However, the algebra involved is considerably simplied by the fact that L()
and () = ln{L()}, the so-called log-likelihood function, have the same max is often determined by maximizing the log-likelihood function
imum. Thus,
instead. In this specic case, we obtain, from Eq (14.34),

() = ln{L()} = n +

i=1

ln xi !

(14.36)

i=1

Dierentiating with respect to and setting the result equal to zero yields:
()
= n +

n
i=1

(14.37)

which gives the nal solution:

n
=

i=1

(14.38)

a reassuringly familiar result, since by denition, is the mean of the Poisson

pdf.
The maximum likelihood estimate of the Poisson parameter is therefore
shown in Eq (14.38); the maximum likelihood estimator for in
the value
terms of the random sample X1 , X2 , . . . , Xn is therefore:
1
Xi
n i=1
n

U (X1 , X2 , . . . , Xn ) =

The foregoing result may now be generalized as follows:

(14.39)

Estimation

499

Maximum Likelihood Estimate: Given X1 , X2 , . . . , Xn , a random sample from a population whose pdf (continuous or discrete),
f (x; ), contains a vector of unknown characteristic parameters,
; the likelihood function for this sample is given by:
L() = f (x1 ; )f (x2 ; ) f (xn ; )

(14.40)

the joint pdf of X1 , X2 , . . . , Xn written as a function of the un that maximizes L() is known as the maxiknown . The value
maximizes
mum likelihood estimate (MLE) of . (The same value
() = ln{L()}.)

This general result is now illustrated below with several specic examples.
Example 14.5: MAXIMUM LIKELIHOOD ESTIMATE OF
AN EXPONENTIAL DISTRIBUTION PARAMETER
Let X1 , X2 , . . . , Xn be a random sample from an exponential population
with pdf
1
(14.41)
f (x; ) = ex/

Obtain the maximum likelihood estimate of the unknown population

parameter, , from this random sample.
Solution:
Since each random variable in the random sample possesses the pdf in
Eq (14.41), the likelihood function in this case is:

1 x2 /
1 xn /
1 x1 /
e
e
e

L() =

1 ( ni=1 xi )/
=
e
(14.42)
n
From here, we easily obtain:
() = ln L() = n ln
so that

n
1
xi
i=1

n
n
1
()
=
+ 2
xi = 0

i=1

(14.43)

(14.44)

which, for = 0, is solved to yield the desired maximum likelihood

estimate as
n
1
xi ,
(14.45)
=
n i=1
another reassuringly familiar result identical to what was obtained earlier in Example 14.2.

500

Random Phenomena

Maximum Likelihood Estimate of Gaussian Population Parameters

When X1 , X2 , . . . , Xn is a random sample from a Gaussian (normal) population with pdf

(xi )2
1
(14.46)
f (xi ; , ) = exp
2 2
2
the corresponding likelihood function is:

n

(xi )2
1
exp
L(, ) =
2 2
2
i=1
n/2
n
i=1 (xi )2
1
1
=
exp
2
n
2 2
The log-likelihood function will therefore be:
n
(, ) = ln L(, ) = ln 2 n ln
2

i=1 (xi
2 2

(14.47)

(14.48)

To determine the maximum likelihood estimates of the unknown parameters

, requires taking appropriate derivatives in Eq (14.48), equating to zero
and solving for the unknowns; i.e.,
n
2 i=1 (xi )

=
=0

2 2
n

xi n = 0
(14.49)
i=1

n
(xi )2

n
= + i=1 3
=0
(14.50)

These two equations must now be solved simultaneously. First, from Eq

(14.49), we obtain:
n
1
xi
(14.51)

=
n i=1

and

which, when introduced into Eq (14.50) and simplied, gives the second and
nal result:
) n
)2
i=1 (xi

=
(14.52)
n
Observe that the MLE of in Eq (14.51) is the same as the sample mean, but
the MLE of in Eq (14.52) is not the same as the sample standard deviation.
For large n, of course, the dierence becomes negligible, but this illustrates
an important point: because the sample standard deviation
&
n
)2
i=1 (xi x
(14.53)
s=
n1

Estimation

501

(where x is the sample mean) is unbiased for , this implies that the MLE for
is biased.
Important Characteristics of MLEs
Of the many characteristics of maximum likelihood estimates, the following
are the two most important we wish to highlight:

is the MLE of the population parameter vector

1. Invariance Property: If

, then g() is also the MLE of g().

2. Asymptotic Properties: As n , the MLE approaches minimum variance and is unbiased. Thus, the MLE is asymptotically unbiased, asymptotically ecient, and consistent.

Thus, according to the rst property, from the MLE of the sample standard
deviation in Eq (14.52), we immediately know that the MLE of the variance
will be given by
n
(xi
)2
2
(14.54)

= i=1
n
The second property makes large sample MLEs very attractive.
The following are a few more examples of MLEs.
Example 14.6: MLE OF A BINOMIAL/BERNOULLI PROBABILITY OF SUCCESS
Let X1 , X2 , . . . , Xn be the result obtained from n Bernuolli trials where
the probability of success is p, i.e.

1; with probability p
(14.55)
Xi =
0; with probability q = 1 p
If
the random variable, X, is the total number of successes in n trials,
i Xi , obtain a MLE for p.
Solution:
There are several ways of approaching this problem. The rst approach
is direct, from the perspective of the Bernoulli random variable; the
second makes use of the fact that the pdf of a sum of n Bernoulli random
variables is a Binomial random variable.
To approach this directly, we recall from Chapter 8, the compact
pdf for the Bernoulli random variable
f (x) = pIS (1 p)IF
where the success indicator, IS , is dened as:

1; for X = 1
IS =
0; for X = 0

(14.56)

(14.57)

502

Random Phenomena
and its complement, the failure indicator, IF

1; for X = 0
IF =
0; for X = 1

(14.58)

The joint pdf for the random sample X1 , X2 , . . . , Xn is therefore given

by
n
n
(14.59)
f (x1 , x2 , . . . , xn ) = p i=1 ISi (1 p) i=1 IFi
and now, because ISi = Xi and IF = 1 Xi , then Eq (14.59) reduces
to
n
n
(14.60)
f (x1 , x2 , . . . , xn ) = p i=1 Xi (1 p)n i=1 Xi
so that the likelihood function is:
L(p) = p

i=1

(1 p)n

i=1

with the log-likelihood function,

n
n

Xi ln p + n
Xi ln(1 p)
(p) =
i=1

(14.61)

(14.62)

i=1

From here, the usual dierential calculus exercise results in:

n

(n n

i=1 Xi
i=1 Xi )
=

=0
p
p
(1 p)
n
n
(n i=1 Xi )
i=1 Xi
=

p
(1 p)
which, when solved for p, yields the result:
n
i=1 Xi
p =
n

(14.63)

(14.64)

as one would intuitively expect.

The second approach uses not the joint pdf of X1 , X2 , . . . , Xn , but
the related pdf of a function of X1 , X2 , . . . , Xn , i.e. X = X1 + X2 + . . . +
Xn , which is known to be a Binomial random variable: i.e.

n X
(14.65)
f (x) =
p (1 p)nX
X

so that the likelihood in this case (for X = n
i=1 Xi ) is:

n X
L(p) =
p (1 p)nX
(14.66)
X
It is a relatively simple exercise left to the reader (See Exercise 14.19)
to establish that this function is maximized when
p =

X
n

(14.67)

exactly the same result as in Eq (14.64) since X = n
i=1 Xi .
Thus, the MLE for p is the total number of successes in n trials
divided by the total number of trials, as one would expect.

Estimation

503

The next example is a specic application of this general result.

Example 14.7: MLE OF OPINION SURVEY PARAMETER
In the opinion survey illustration used to open this chapter, out of 100
students surveyed by the opinion pollster, 75 indicated a preference
for closed-book exams. If one considers the sampling of each student to
constitute a Bernoulli trial in which the outcome, preference for closedbook exams, is nominally considered a success, and if the student
population is such that the true fraction with a preference for closedbook exams is c , nd the MLE of c from the survey result.
Solution:
This is clearly a case in which X, the total number of respondents showing a preference for closed-book exams, is a Binomial random variable,
in this case with n = 100 and p = c unknown. With the observation,
x = 75, we therefore easily obtain, from the results of Example 14.6,
that
(14.68)
p = c = 0.75

14.4

Precision of Point Estimates

We have shown that, given a random sample X1 , X2 , . . . , Xn , regardless

of the underlying population distribution from which the sample came, the
estimators
n

= 1
X
Xi
(14.69)
n i=1
n

and
2

S =

2
X)
n1

i=1 (Xi

(14.70)

are both unbiased for the population mean , and variance 2 , respectively. They are thus both accurate. For any specic set of observations,
(x1 , x2 , . . . , xn ), the computed
1
xi
n i=1
n

x
=

in general will not be identically equal to ; neither will

n
(xi x
)2
2
s = i=1
n1

(14.71)

(14.72)

be identical to 2 , in general. The question of concern now is this: how close to

do we expect x
, and s2 to 2 ? To answer this question, we need sampling

504

Random Phenomena

0.4

f(z)

0.3

0.2

0.1
D

D
0.0

-z D

0
Z

z D

FIGURE 14.2: Two-sided tail area probabilities of /2 for the standard normal sampling distribution

distributions that will enable us make appropriate probabilistic statements

about the proximity of estimates to unknown parameter values.
)/(/n) has
We therefore begin by recalling that for large n, Z = (X
a distribution that is approximately N (0, 1), with the implication that:

X

P z/2 <
< z/2 = (1 )
(14.73)
/ n
where z/2 is that value of z with a tail area probability of /2, as illustrated
in Fig 14.2. We may therefore state that with probability (1), the proximity
to is characterized by:
of X
|
|X
< z/2
/ n

(14.74)

For the specic case where is chosen as 0.05, then, z/2 , the value of the
standard normal random variable, z, for which the tail area probability is
0.025 is 1.96. As a result,
|
|X
< 1.96
(14.75)
/ n
implying that

| < 1.96/ n
|X
(14.76)

The implication of this result is that given , we can state

that X will not be
farther away from the true value, , by more than 1.96/ n, with probability
0.95. Alternatively, Eq (14.76) may be restated as:
1.96

=X
n

(14.77)

Estimation

505

indicating that, with

95% condence, the true value, , will lie in the interval
that stretches 1.96/ n to the right and to the left of the sample average. We
shall return to this statement shortly, but for now we note that the estimate
becomes more precise with larger sample size n; and in the limit as n ,
the estimate coincides precisely with the true value.
When is unknown, and the sample size n is small, we simply replace
with the estimate S, and replace z with the t-distribution equivalent in Eq
(14.74) to obtain:
|
|X
< t/2 (n 1)
(14.78)
S/ n
with probability (1 ), where n 1 represents the degrees of freedom associated with the t-distribution.
For the variance, the question how close is S 2 to 2 (or equivalently, how
close is S to ) is answered by appealing to the result that when sampling
from a normal population, (n 1)S 2 / 2 has a 2 (n 1) distribution. This
theoretical sampling distribution may then be used to make probability statements about the closeness of S 2 to 2 . Unfortunately, not much can be said
in general when the sampling is from non-normal populations.
For binomial proportions, the question how close is p = X/n to p is answered in the same manner as discussed above for the mean and the variance,
provided the sample size n is large. Since the variance of the binomial random
2
= np(1 p), so that p2 = p(1 p)/n, we may use the Central
variable, X
Limit Theorem to infer that [(
p p)/p ] N (0, 1) and hence use the standard
normal approximation to the sampling distribution of p to make probability
statements abut the proximity of p = X/n to p. Thus, for example,
|
p p|
< z/2
p / n

(14.79)

with probability (1 ), as the next example illustrates.

Example 14.8: PRECISION OF OPINION SURVEY RESULT
In Example 14.7, the MLE of p, the true proportion of college students
with a preference for closed-book exams, was estimated as 0.75 from
the opinion survey result of 100 students. How precise is this estimate?
Solution:
As was the case in Example 14.7, on the assumption that X, the total
number of respondents with a preference for closed-book exams, is a
2
= np(1 p). And for p = X/n,
Binomial random variable, then X
V ar(
p) =

2
p(1 p)
X
=
n2
n

so that p , the standard deviation for p, is:

p(1 p)

p =
n

(14.80)

(14.81)

506

Random Phenomena
with p estimated as 0.75. Assuming that n = 100 is suciently large
so that the standard normal approximation for {(
p p)/p } holds in
this case, we obtain immediately from Eq (14.77) that, with probability
0.95,

(14.82)
c = p = 0.75 1.96 0.75 0.25/10 = 0.75 0.085.
This is how the surveys margin of error quoted at the beginning of
the chapter was determined.

We may now observe that in adding a measure of precision to point estimates, the net result has appeared in the form of an interval within which
the true parameter is expected to lie, with a certain pre-specied probability.
This motivates the concept of interval estimation.

14.5
14.5.1

Interval Estimates
General Principles

Primarily because they are based on incomplete population information

from samples (samples which are themselves subject to variability), point estimates, we now know, will never coincide identically with the theoretical
parameters being estimated. In the previous section, we dealt with this problem by seeking to quantify the precision of the point estimate. We did this
the estimate, is to ,
by determining, in a probabilistic sense, how close ,
as an estimator
the true but unknown value. The results we obtained for X
of began as a probability statement in Eq (21.74), whose argument, when
rearranged as follows:

X z/2
< < X + z/2
(14.83)
n
n
gives the interval within which we expect the true value to lie, with probability (1 ). This provides a dierent way of estimating an approach that
combines, in one self-contained statement, the estimate and a probabilistic
measure of its precision; it is called an interval estimate.
There are, therefore, two main aspects of an interval estimate:
1. The boundaries of the interval; and
2. The associated probability (usually termed the degree of condence)
that the specied random interval will contain the unknown parameter.
The interval estimators are the two statistics UL and UR used to determine the
left and right boundaries respectively; the sampling distribution of the point

Estimation

507

estimator is used to obtain the appropriate interval estimates that correspond

to the pre-specied probability, (1 ), the desired degree of condence. The
result is then typically known as the (1 ) 100% condence interval. Since,
as discussed in Chapter 13, the nature of sampling distributions depends on
what is known about the underlying population, the same is true for methods
for obtaining interval estimates, and for the very same reasons.

14.5.2

Mean of a Normal Population; Known

If X1 , X2 , . . . , Xn is a random sample from a normal population, then we

the sample average, is a good point estimator that enjoys many
know that X,
desirable properties. We also know that

X

Z=
(14.84)
/ n
has a standard normal, N (0, 1), distribution. From this sampling distribution
we now obtain the following probability statement:
for the statistic X,

X
< z/2 = (1 )
P z/2 <
(14.85)
/ n
as we did earlier in Eq (21.74), which converts to the
interval shown in Eq
z/2 (/ n)] contains with
(14.83), implying nally that the interval [X
probability (1). Specically for = 0.05, the commonly used default value,
z/2 = 1.96, so that the resulting interval,

1.96(/ n),
CI = X

(14.86)

is therefore the 95% condence interval for , the mean of a normal population
estimated from a random sample of size n.
The general procedure for obtaining interval estimates for the mean of a
normal population is therefore as follows:

1. Determine the point estimator (the sample average) and its distribution
(a standard normal for the normalized average, which will include , the
population variance, assumed known);
2. Determine the end points of an interval that will contain the unknown
mean , with specied probability (typically (1 ), with = 0.05).

The following example illustrates this procedure.

Example 14.9: INTERVAL ESTIMATES FOR MEANS OF

508

Random Phenomena
PROCESS YIELDS
Given that the result of a series of 50 experiments performed on the
chemical processes discussed in Chapter 1 constitute random samples
from the respective populations for the yields, YA and YB , assume that
these are two normal populations and obtain 95% condence interval
estimates for the population means A and B , given the respective
population standard deviations as A = 1.5 and B = 2.5.
Solution:
From the supplied data, we obtain the sample averages as:
yA = 75.52; yB = 72.47

(14.87)

and given the standard deviations and n = 50, we obtain the following
interval estimates:

A = 75.52 1.96(1.5/ 50) = 75.52 0.42

(14.88)
and

B = 72.47 1.96(2.5/ 50) = 72.47 0.69

(14.89)

The implication is that, with 95% condence, the mean yield for each
process is characterized as follows: for process A, 75.10 < A < 75.94;
and for process B, 71.78 < B < 73.16. As a preview of upcoming
discussions, note that these two 95% condence intervals do not overlap.

14.5.3

Mean of a Normal Population; Unknown

When the population standard deviation is unknown, of course the point

but now we must introduce the standard
estimate remains unchanged as X,
deviation estimator,
&
n
2
i=1 (Xi X)
(14.90)
S=
n1
for the unknown standard deviation. And from Chapter 13, we know that the
sampling distribution of the statistic:
T =

S/ n

(14.91)

is the t-distribution with = n 1 degrees of freedom. As such, we know that:

X
< t/2 (n 1) = (1 ),
P t/2 (n 1) <
(14.92)
S/ n
which is the t-distribution analog to Eq (21.74). As usual, t/2 (n 1) is the
value of the t random variable that yields a tail area probability of /2 for a tdistribution with n1 degrees of freedom. Thus, when the standard deviation,

Estimation

509

, is unknown, the required (1 ) 100% condence interval for a normal

population mean, , is:

S
S

X t/2 (n 1)
< < X + t/2 (n 1)
(14.93)
n
n
Example 14.10: INTERVAL ESTIMATES FOR MEANS OF
PROCESS YIELDS: UNKNOWN POPULATION VARIANCES
Repeat the problem in Example 14.9 and obtain 95% condence interval estimates for the population means A and B , still assuming that
the data came from normal populations, but with unknown standard
deviations.
Solution:
As obtained in the earlier example, the sample averages remain:
yA = 75.52; yB = 72.47

(14.94)

From the data, we also obtain the sample standard deviations as

sA = 1.43; sB = 2.76

(14.95)

With n = 50, so that = 49, we obtain from MINITAB

t0.025 (49) = 2.01

(14.96)

(This is obtained using the inverse cumulative probability feature:

(Calc > Prob Distr. > t > Inverse Cum Prob) entering 49 for the
degrees of freedom, and 0.025 as the desired tail area. This returns
the result that P (T < 2.01) = 0.025, which, by symmetry implies
that P (T > 2.01) = 0.025, so that t0.025 (49) = 2.01.)
We now easily obtain the following interval estimates:

(14.97)
A = 75.52 2.01(1.43/ 50) = 75.52 0.41
and

B = 72.47 2.01(2.76/ 50) = 72.47 0.78

(14.98)

The 95% condence intervals for the mean yield for each process is now
as follows: for Process A, (75.11 < A < 75.93); and for Process B,
(71.69 < B < 73.25).
Note that these interval estimates are really not that dierent from
those obtained in Example 14.9; in fact, the estimates for A are virtually identical. There are two reasons for this: rst, and foremost, the
sample estimates of the population standard deviation, sA = 1.43; sB =
2.76, are fairly close to the corresponding population values A = 1.5
and B = 2.5; second, the sample size n = 50 is suciently large so
that the dierence between the t-distribution and the standard normal
is almost negligible (observe that z0.025 = 1.96 is only 2.5% less than
t0.025 (49) = 2.01).
Again, note that these two 95% condence intervals also do not
overlap.

510

Random Phenomena

14.5.4

Variance of a Normal Population

Obtaining interval estimates for the variance of a normal population follows the same principles outlined above: obtain the sampling distribution of
an appropriate statistic (the estimator) and use it to make probabilistic statements about an interval that is expected to contain the unknown parameter.
In the case of the population variance, the estimator is:
n
2
(Xi X)
2
(14.99)
S = i=1
n1
and the sampling distribution is obtained from the fact that
(n 1)S 2
2 (n 1)
2

(14.100)

when sampling from a normal population.

From here, we are now able to make the following probability statement:

(n 1)S 2
2
2
< /2 (n 1) = (1 ),
(14.101)
P 1/2 (n 1) <
2
where, because of the asymmetry of the 2 distribution, the left boundary
is the number 21/2 (n 1), the value of the 2 random variable for which
the area to its right under the 2 -distribution with n 1 degrees of freedom
is (1 /2), so that the tail area to the left will be the desired /2; on the
right boundary, the value is 2/2 (n 1). Fig 14.3 shows such a sampling
distribution with 9 degrees of freedom, along with the left and right boundary
values for = 0.05. As a result of the asymmetry, 21/2 (n 1) = 2.7 while
2/2 (n 1) = 19.0 in this case.
The expression in Eq (14.101), when rearranged carefully, yields the result
for the (1 ) 100% condence interval for the population variance, 2 , as:
(n 1)S 2
(n 1)S 2
2
<

<
2/2 (n 1)
21/2 (n 1)

(14.102)

The 95% condence interval on the standard deviation is obtained by taking

square roots to yield:
&
&
(n 1)
(n 1)
S
<<S
(14.103)
2/2 (n 1)
21/2 (n 1)
Example 14.11: INTERVAL ESTIMATES FOR VARIANCES
OF PROCESS YIELDS
Obtain the 95% condence interval estimates for the population vari2
2
and B
for the process yield data sets discussed in Chapter
ances A
1 and in Example 14.9. Assume, as before, that the data came from
normal populations. How do the respective population variances and

Estimation

511

0.10

f(x)

0.08

Chi-Square, df (n-1) =9

0.06

0.04

0.02

D = 0.025
D = 0.025

0.00

2.70

19.0

FIGURE 14.3: Two-sided tail area probabilities of /2 = 0.025 for a Chi-squared

distribution with 9 degrees of freedom
standard deviations specied in Example 14.9 t into these estimated
intervals?
Solution:
First, we recall the sample standard deviations computed in Example
14.10 as sA = 1.43; sB = 2.76; with specied as 0.05, we obtain from
MINITAB (again using the inverse cumulative probability feature, this
time for the 2 distribution, with = 49), that:
20.975 (49) = 31.6; 20.025 (49) = 70.2

(14.104)

From here, using Eq (14.102), the following interval estimates for the
variances are obtained as:
2
< 3.18
1.43 < A

(14.105)

2
< 11.85
5.33 < B

(14.106)

or, upon taking square roots, the interval estimates for the standard
deviations are:
1.20 < A < 1.78

(14.107)

2.31 < B < 3.44

(14.108)

Note that the population standard deviations, specied earlier in

Example 14.9 as A = 1.5 and B = 2.5, fall entirely within their
2
2
respective estimated intervals (the variances A
= 2.25; B
= 6.25 also
fall within the respective estimated intervals for variances).

The results discussed above for interval estimates of single means and variances from normal populations have implications for hypothesis testing, as we

512

Random Phenomena

show in the next chapter. They can be extended to interval estimates of the
dierences between the means of two normal populations, as we will now do:
this also has implications for hypothesis testing.

14.5.5

Dierence of Two Normal Populations Means

Consider X1 , X2 , . . . , Xn , a random sample from a normal population,

2
) and independent from Y1 , Y2 , . . . , Ym , anwith the distribution N (X , X
other random sample from a dierent normal population, with the distribution
N (Y , Y2 ), where the sample sizes, n and m need not be equal (i.e. n = m).
We are now concerned with determining
XY = X Y

(14.109)

the dierence between the two population means, by obtaining a point estimate along with a condence interval.
is the MLE for X , and Y is the MLE for Y , then it is straightforward
If X
dened as:
to show (see Exercise 14.27) that D,
=X
Y
D

(14.110)

is the MLE for XY ; it is also unbiased. And now, to obtain the interval esti we need its sampling distribution, which is obtained as follows: we
mate for D,
N (X , 2 /n) and Y N (Y , 2 /m);
know from previous results that X
X
Y
and from results about distributions of sums of Gaussian random variables,
N (XY , v 2 ) where:
we now obtain that D
XY
v2

= X Y
2
2
X
+ Y
=
n
m

(14.111)
(14.112)

The latter equation arises from the fact that

= V ar(X)
+ V ar(Y )
V ar(D)

(14.113)

2
by independence. And now, if X
and Y2 are known (so that v 2 is also known)

then observe that the statistic (DXY )/v has a standard normal distribution.
Thus
Y ) (X Y )
(X
% 2
N (0, 1)
(14.114)
2
X
Y
+
n
m

from which we obtain the probability statement:

(X Y ) (X Y )
% 2
P z/2 <
< z/2 = 1
2
X
Y
n + m

(14.115)

Estimation

513

so that the (1 ) 100% condence interval for XY is given as:

)
2
2
X

+ Y
(14.116)
XY = (X Y ) z/2
n
m
The next example illustrates this result.
Example 14.12: INTERVAL ESTIMATES FOR DIFFERENCE BETWEEN TWO PROCESS YIELD MEANS
Obtain a 95% condence interval estimate for the dierence between
the population means A and B for the process yield data in Example
14.9, given the respective population standard deviations as A = 1.5
and B = 2.5.
Solution:
Since yA = 75.52; yB = 72.47, so that d = yA yB = 3.05, the desired
95% condence interval is obtained from Eq (14.116) as

(14.117)
AB = 3.05 1.96 (2.25 + 6.25)/50 = 3.05 0.81
Thus, with 95% condence, we expect 2.24 < AB < 3.86.

The result of this example foreshadows part of the upcoming discussion in the
next chapter on hypothesis testing. For now we simply note the most obvious
implication: it is highly unlikely that the mean of the yield obtainable from
process A is the same as that from process B; in fact, the evidence seems
to support the postulate that the mean yield obtainable from process A is
greater than that from process B, by as little as 2.24 and possibly by as large
as 3.86.
This example also sheds light in general on how we can use the interval
estimate of the dierence between two means to assess the equality of two
normal population means:
1. If the interval estimate for X Y contains the number 0, the implication is that X and Y are very likely equal;
2. If the interval estimate for X Y lies entirely to the right of 0, the
implication is that, very likely, X > Y ; and nally,
3. If the interval estimate for X Y lies entirely to the left of 0, the
implication is that, very likely, X < Y .
2
and Y2 , are unknown, in general,
When the population variances, X
things become quite complicated, especially when n = m. Under these circumstances, it is customary to use
)
2
2
Y ) z/2 SX + SY
(14.118)
XY = (X
n
m

as an approximate (1 ) 100% condence interval for XY , where the

variances in Eq (14.116) have been been replaced by the sample equivalents.

514

Random Phenomena

2
When X
and Y2 are unknown but equal to 2 , we can use the tdistribution as we have done previously to obtain
)
1
1

+
(14.119)
XY = (X Y ) t/2 ()Sp
n m

as the (1 ) 100% condence interval for XY , where , the degrees of

freedom is dened as:
=n+m2
(14.120)
and Sp , known as the pooled standard deviation, is obtained from the expression
2
(n + m 2)Sp2
+ (m 1)SY2
(n 1)SX
=
(14.121)
2

2
or,
&
Sp =

2 + (m 1)S 2
(n 1)SX
Y
(n + m 2)

(14.122)

the positive square root of a weighted average of the two sample variances.

14.5.6

Interval Estimates for Parameters from other Populations

While the most readily available results for interval estimates are for samples from Gaussian populations, it is still possible to obtain interval estimates
for parameters from non-Gaussian populations. One simply needs to remember that the key to interval estimation is the sampling distribution of the
estimator. If we are able to obtain the appropriate sampling distribution, it
can be used to make the sort of probabilistic statements on which interval
estimates are based.
Means; Large Samples
Fortunately, when sample sizes are large, it is possible to invoke the central limit theorem to determine that,
regardless of the underlying distribution
)/(/n) possesses an approximate N (0, 1) dis(Gaussian or not), (X
tribution, with the approximation improving as n . Furthermore, even
if is unknown (as is usually the case in most problems of practical relevance), the large sample size makes it acceptable to approximate with S,
the sample standard deviation. Thus, no new results are required under these
circumstances. The following example illustrates this point.
Example 14.13: INTERVAL ESTIMATE FOR MEAN OF INCLUSIONS DATA
The number of inclusions found on glass sheets produced in the manufacturing process discussed in Chapter 1 has been identied as a Poisson
random variable with parameter . If the data in Table 1.2 is considered

Estimation

515

a random sample of 60 observations, obtain a 95% condence interval

for the parameter .
Solution:
The data is from a Poisson population, not from a Gaussian one; but
the sample size is large. Having determined, earlier in this chapter, that
the sample mean is the MLE for the Poisson parameter, we conclude
= x
from the supplied data that
= 1.02; also the sample standard
deviation, s = 1.1. A sample size of 60 is typically considered large
enough (n > 50) so that the standard normal approximation for the
)/(/n) is valid in this case. The immediate
distribution of (X
implication is that the desired 95% condence interval is:

(14.123)
= 1.02 1.96(1.1/ 60) = 1.02 0.28
so that, with 95% condence, we can expect the true mean number of
inclusions found on the glass sheets made in this manufacturing site to
be characterized as: 0.74 < < 1.30

Thus, the (1 ) 100% interval estimate for the binomial proportion p

is obtained as
)
)
pq
pq
< p < p + z/2
(14.124)
p z/2
n
n
where the sample estimates of the variance have been introduced for the unknown population variance.
Means; Small Samples
When sample sizes are small, the standard normal approximations are
typically unjustiable. Under these circumstances, one must obtain the appropriate sampling distribution for the particular problem at hand, and then
use it to determine the interval estimate. Let us illustrate this concept for
samples drawn from an exponential population, using the data of Example
14.3.
Example 14.14: INTERVAL ESTIMATES FOR EXPONENTIAL DISTRIBUTION MEANS
From the data on waiting time (in days) until the occurrence of a recordable safety incident in a certain companys manufacturing site given in
Example 14.3, obtain a 95% condence interval for this exponential random variables unknown parameter , rst for the rst year data set S1
and then for the second year data set S2 . Compare these interval estimates.
Solution:
The point estimate for the rst year data was obtained from the sample average in Example 14.3 as 30.1 (this is also the MLE). Now, if
Xi E(), then from results presented in Chapter 6, or more specically, from Eq (9.39) in Chapter 9, we know that the random variable

516

Random Phenomena
1.4
1.2
1.0

Gamma(10, 0.1)

f(x)

0.8
0.6
0.4
0.2
0.025
0.0

0.025
0.480

1.71

FIGURE 14.4: Sampling distribution with two-sided tail area probabilities of 0.025 for

X/,
based on a sample of size n = 10 from an exponential population
dened as
X,

n

= 1
Xi
X
n i=1

(14.125)

has the Gamma distribution (n, /n). However, note that this pdf
depends on the unknown parameter and can therefore not be used,
as is, to make probabilistic statements. On the other hand, by scaling
with , we see that
X
1
X (n, 1/n)
(14.126)

a pdf that now depends only on the sample size, n. (This is directly
analogous to the t-distribution which depends only on the degrees of
freedom, (n 1)).
And now for the specic case with n = 10, we obtain from the
Gamma(10,0.1) distribution the following:

X
< 1.71 = 0.95
(14.127)
P 0.48 <

(see Fig 14.4) where the values for the interval boundaries are obtained
from MINITAB using the inverse cumulative probability feature.
For the specic case of the rst year data with x
= 30.1, the expression in Eq (15.144) may then be rearranged to yield the 95% condence
interval:
17.6 < 1 < 62.71
(14.128)
and for the second year data, with x
= 32.9,
19.24 < 2 < 68.54

(14.129)

Estimation

517

f(x)

Gamma(100, 0.01)
2

1
0.025
0

0.025
0.814

1.21

FIGURE 14.5: Sampling distribution with two-sided tail area probabilities of 0.025 for

X/,
based on a larger sample of size n = 100 from an exponential population
First, note the asymmetry in these intervals in relation to the respective
this should not come as a surprise,
values for the point estimates, X:
since the Gamma distribution is skewed to the right. Next, observe that
these intervals are quite wide; this is primarily due to the relatively
small sample size of 10. Finally, observe that the two intervals overlap
considerably, suggesting that the two estimates may not be dierent
at all; the dierence of 2.8 from year 1 to year 2 is more likely due to
random variation than to any actual systemic improvement.

To use this last example to illustrate the applicability of the standard

normal approximation for large sample sizes, let us consider that the sample
averages x
1 = 30.1 and x2 = 32.9 were actually obtained from a sample size

of 100. Under these conditions, the sampling distribution for X/

will be
Gamma(100,0.01), and the values of the random variable that will yield twosided tail area probabilities of 0.025 are obtained from MINITAB as 0.814 and
1.21 (see Fig 14.5). For the rst year data with x
= 30.1, we are then able
obtain the 95% condence interval:

X
X
< 1 <
= 24.88 < 1 < 36.98
1.21
0.814

(14.130)

and, similarly for the second year data, with x

= 32.9,
27.19 < 2 < 40.42

(14.131)

Not surprisingly, these intervals are considerably tighter than the corresponding ones obtained for the smaller sample size n = 10.

518

Random Phenomena

We now wish to compare these intervals with ones obtained by invoking

the approximate N (0, 1) distribution for means computed from large samples.
For this, we need sample standard deviations, which we have not had any use
for until now; these are obtained from the data sets S1 and S2 in Example 14.3
as s1 = 23.17; s2 = 27.51. If we assume that these are reasonably close approximations to the true population standard deviations (which would be the case
had we actually used a sample size of 100), then we obtain the approximate
95% condence intervals as follows:
1
2

23.17
= 30.1 1.96
= 30.1 4.54
100
22.51
= 32.9 1.96
= 32.9 5.39
100

(14.132)
(14.133)

which, when written in the same form as in Eqs (14.130) and (14.131), yields
25.56 < 1 < 34.54
27.51 < 2 < 38.29

(14.134)
(14.135)

We see that the approximation is in fact quite good.

14.6
14.6.1

Bayesian Estimation
Background

In all our discussion until now, the unknown population parameters, ,

have been considered as xed, deterministic constants whose values we have
sought to determine solely on the basis of information contained in sample
data drawn from the population in question. No `
a-priori knowledge or information about is considered or assumed. And the results obtained from the
estimation techniques presented thus far have either been the best single point
values for each parameter (point estimates), or else appropriate intervals that
we expect to contain each parameter, with a pre-specied probability, typically
0.95 (interval estimates).
There is an alternative, fundamentally dierent approach, with the following dening characteristics:
1. The unknown parameters are considered as random variables, , with
considered as a specic value of the random vector whose pdf is f ();
2. Instead of providing point estimates along with a probabilistic interval,
the objective is to obtain the full pdf, from which all sorts of probabilistic
statements can be made;

Estimation

519

3. Any available prior information about the random vector and its pdf,
can and should be used in conjunction with sample data in providing
parameter estimates.
This approach is known as Bayesian Estimation. Its basis is the fundamental
relationship between joint, conditional and marginal pdfs, which we may recall
from Chapter 4 as:
f (x, y)
(14.136)
f (x|y) =
f (y)
from which one obtains the following important result
f (x, y) = f (x|y)f (y) = f (y|x)f (x)

(14.137)

that is used to reverse conditional probabilities, since upon rearrangement Eq

(14.137) becomes:
f (x|y)f (y)
f (y|x) =
(14.138)
f (x)
This expression is known as Bayes Theorem; it is attributed to the Revd
Thomas Bayes (17021761), a Presbyterian minister and something of an amateur mathematician in the original sense of the word amateur. The theorem
that bears his name appeared in his now-famous, posthumously-published,
paper; but the subject of that paper was in fact just a special case of the more
general version later proved by Laplace (17491827).

14.6.2

Basic Concept

Consider a random sample X1 , X2 , . . . , Xn from a population with pdf

f (x; ) and unknown parameters ; we know that the joint pdf is given by:
f (x1 , x2 , . . . , xn |) = f (x1 ; )f (x2 ; ) f (xn ; )

(14.139)

This is the conditional pdf of the data conditioned on ; for any given value
of , this expression provides the probability of jointly observing the data
{x1 , x2 , . . . , xn } in the discrete case. For the continuous case, it is the density
function to be used in computing the appropriate probabilities. (Recall that
we earlier referred to this same expression as the likelihood function L() in
Eq 14.40.)
Now, if is considered a random variable for which is just one possible
realization, then in trying to determine , what we desire is the conditional
probability of given the data; i.e. f (|x1 , x2 , . . . , xn ), the reverse of Eq
(14.139). This is obtained by invoking Bayes Theorem,
f (|x1 , x2 , . . . , xn ) =

f (x1 , x2 , . . . , xn |)f ()
f (x1 , x2 , . . . , xn )

(14.140)

520

Random Phenomena

where
f (x1 , x2 , . . . , xn |)

the sampling distribution

f ()

the prior distribution of

f (x1 , x2 , . . . , xn )

the marginal distribution of the data.

f () is the marginal distribution of without considering the data, a distribution dened `

a-priori, before acquiring data (and independent of the current
data set). As a result of the convention of referring to f () as the prior distria-posteriori)
bution for , f (|x1 , x2 , . . . , xn ) is referred to as the posterior (or `
distribution of because it is obtained after acquiring data (i.e. conditioned
upon the observed data).
Now, because x1 , x2 , . . . , xn constitutes an observed data set with known
values, any function of such known quantities is itself known. As such,
f (x1 , x2 , . . . , xn ), regardless of its actual functional form, is a known constant once the observation x1 , x2 , . . . , xn is given. Thus, we may rewrite Eq
(14.140) as
Cf ()f (x1 , x2 , . . . , xn |)
PRIOR SAMPLING

f (|x1 , x2 , . . . , xn ) =
POSTERIOR

(14.141)

Thus, through Eq (14.141), the posterior pdf of combines prior information

about available as f () with information from sample data available as
f (x1 , x2 , . . . , xn |) (more compactly, f (x|)).

14.6.3

Bayesian Estimation Results

The primary result of Bayesian estimation is f (|x), the posterior pdf for
conditioned upon the observed data vector x; no point or interval estimates
are given directly. However, since f (|x) is a full pdf, both point and interval
estimates are easily obtained from it. For example, the mean, median, mode,
or for that matter, any reasonable quantile of f (|x) can be used as a point
estimate; and any interval, q1/2 < < q/2 , encompassing an area of
probability (1 ) can be used an an interval estimator. In particular,
1. The mean of the posterior pdf,

E[|X1 , X2 , . . . , Xn ] =

f (|x1 , x2 , . . . , xn )d

(14.142)

is called the Bayes estimator.

`-posteriori
2. The mode of f (|x1 , x2 , . . . , xn ) is called the maximum a
(MAP) estimator.
The typical procedure for carrying out Bayesian estimation may now be
summarized as follows: Given a random variable, X, with pdf f (x; ),

Estimation

521

1. Begin by specifying a prior distribution, f (), a summary of prior knowledge about the unknown parameters ;
2. Obtain sample data in the form of a random sample X1 , X2 , . . . , Xn , and
hence the sampling distribution (the joint pdf for these n independent
random variables with identical pdfs f (xi ; ));
3. From Eq (14.141) obtain the posterior pdf, f (|x1 , x2 , . . . , xn ); (if
needed, determine C such that f (|x1 , x2 , . . . , xn ) is a true pdf, i.e.

1
=
f ()f (x1 , x2 , . . . , xn |)d
(14.143)
C

4. If point estimates are required, obtain these from the posterior distribution.

14.6.4

A Simple Illustration

We consider the problem of determining the Binomial/Bernoulli probability of success parameter, p.

Data
Upon performing the required experiment, suppose that the result (after n
trials) is Xi , i = 1, 2, . . . , n, a random sample of Bernoulli variables for which,
Suppose that the total
as usual, Xi = 1 for success and 0 for failure.
n
number of successes is determined to be X (i.e. X = i=1 Xi ). We know that
the sampling distribution of X is the binomial pdf:

n X
f (x|) =
(1 )nX
(14.144)
X
(We could also treat this as a random sample from a Bernoulli population; the
resulting joint
pdf will be the same as in Eq (14.144), up to the multiplicative
n
.)
constant X
The Prior Distribution
The signature feature of Bayesian estimation is the prior distribution and
its usage. There are several ways to decide on the prior distribution for , the
Binomial/Bernoulli parameter, and we will consider two.
CASE I: We know that can only take values between 0 and 1, restricting
its range to the region [0,1]. If there is no additional `
a-priori information
available about the parameter, then we could invoke the maximum entropy
result of Chapter 10 to determine that the best prior distribution under these
circumstances is the uniform distribution on the unit interval, i.e.:
f () = 1; 0 < < 1

(14.145)

522

Random Phenomena

CASE II: In addition to the known range, there is also prior information,
for example, from a similar process for which on average, the probability
of success is 0.4, with a variability captured by a variance of 1/25. Under
these circumstances, again the maximum entropy results of Chapter 10 suggest
a beta distribution with parameters (, ) determined from the prescribed
mean and variance. From the expressions for the mean and variance of the
Beta(, ) random variable given in Chapter 9, we are able to solve the two
equations simultaneously to obtain
= 2; = 3

(14.146)

the prescribed prior pdf is therefore

f () = 12(1 )2

(14.147)

The Posterior Distribution and Point Estimates

We are now in a position to obtain posterior distributions for each case.
For CASE I, the posterior distribution is obtained from Eq (14.141) as:

n x
(14.148)
f (|x) = C 1
(1 )nx
x
where the specic observation X = x has been introduced. To complete the
determination of the posterior pdf, we need to determine the constant C.
There are several ways to do this: by integration, as noted earlier (see Eq
(14.143)), or by inspection by which we mean that, as a function of , Eq
(14.148) looks like a Beta pdf, a fact that can be exploited as follows: From
the exponents of and of (1 ), this posterior pdf looks like the pdf of a
Beta(, ) random variable with 1 = x, and 1 = n x, implying that
= x + 1; = n x + 1

(14.149)

This being the case, the multiplying constants in Eq (14.148) must therefore
be (+)/()(). And because n and x are both integers in this problem,
we are able to use the factorial representation of the Gamma function (see Eq
(9.25)) to obtain the complete posterior pdf as:
f (|x) =

(n + 1)! x
(1 )nx
x!(n x)!

(14.150)

It is left as an exercise to the reader to establish that, upon completing the

integration in
1
n x
(14.151)
(1 )nx d = 1
C
x
0

Estimation

523

and solving for C, the result is

C =n+1
so that
C

n
(n + 1)!
=
x
x!(n x)!

(14.152)

(14.153)

as implied in Eq (14.150).
We may now choose to leave the result as the pdf in Eq (14.150) and
use it to make probabilistic statements about the parameter; alternatively, we
can determine point estimates from it. For example, the Bayess estimate, B ,
dened as E[|X] is obtained from Eq (14.150) as:
x+1
B =
n+2

(14.154)

where the result is obtained immediately by virtue of the posterior pdf being a
Beta distribution with parameters given in Eq (14.149); or else (the hard way!)
by computing the required expectation via direct integration. On the other
hand, the MAP estimate, , is obtained by nding the maximum (mode) of
the posterior pdf: from Eq (14.150) one obtains this as:
x
= ,
n

(14.155)

the same as the MLE.

For the sake of completeness, let us now suppose that after performing an
actual series of experiments on a sample of size n = 10, one obtains 3 successes:
this specic experimental result generates the following point estimates for the
Binomial/Bernoulli probability of success:
B = 4/12 = 0.33; = 3/10 = 0.3

(14.156)

Thus, using a uniform prior distribution for f (), produces a Bayes estimate,
estimate of 1/3, compared to MAP estimate of 0.3 (which coincides with the
standard MLE estimate).
Note that in CASE I, the prior distribution is somewhat uninformative
and non-subjective, in the sense that it showed no preference for any value
of `
a-priori. Note that since x/n is known to be an unbiased estimate for
, then B in Eq (14.154) is biased. However, it can be shown, (see Exercises
14.8, 14.34 and 14.33) that the variance of B is always less than that of the
unbiased MLE, = x/n. Thus, the Bayes estimate may be biased, but it is
more ecient.
CASE II is dierent. The possible values of are not assigned equal a`priori probability; the a`-priori probability specied in Eq (14.147) denitely
favors some values over others. We shall return shortly to the obvious issue

524

Random Phenomena

of subjectivity that this approach raises. For now, with the pdf given in Eq
(14.147), the resulting posterior pdf is:

n x+1
f (|x) = C 12

(1 )nx+2
(14.157)
x
It is straightforward to establish (see Exercise 14.33) that the nal posterior
pdf is given by
f (|x) =

(n + 4)!
x+1 (1 )nx+2
(x + 1)!(n x + 2)!

(14.158)

so that the Bayes estimate and the MAP estimate in this case are given
respectively as:
x+2
(14.159)
B =
n+5
and
x+1
=
(14.160)
n+3
Again, with the specic experimental outcome of 3 successes in n = 10 trials,
we obtain the following as the CASE II estimates
B = 5/15 = 0.33; = 4/13 = 0.31

(14.161)

It is important to note that as dierent as CASE I and CASE II conditions

are, and as dierent as the criteria for determining the Bayes estimates and
the MAP estimates are, the results are all quite similar. Still, as we noted in
Case I, so it is in this case that the estimates, B and , are biased, but their
variances can be shown to be smaller than that of the MLE estimate.

14.6.5

Discussion

The Bayesian Controversy: Subjectivity

Bayesian estimation is considered controversial in many scientic circles.
The primary reason is that, in contrast to standard (or frequentist) estimation, which is based entirely on data alone, Bayesian estimation combines prior
information with data to obtain parameter estimates. Since the prior distribution is mostly a subjective description of the variability in the unknown
parameter vector before data is acquired, the argument against Bayesian
estimation is that it introduces subjectivity into data analysis.
In many other circles, however, the very fact that Bayesian estimation provides a systematic methodology for incorporating prior knowledge into data
analysis is what makes it quite attractive. Consider, for example, a study of
the reliability of a new generation of power distribution networks. Such studies usually build on results from previous studies on earlier generations, which
themselves were follow-ups to earlier studies. While each new generation of

Estimation

525

networks have their own characteristics and properties, ignoring earlier studies
of previous generations as if they did not exist is not considered good engineering practice. Whatever relevant prior information is available from previous
studies should be incorporated into the current analysis. Many areas of theoretical and applied sciences (including engineering) advance predominantly
by building on prior knowledge, not by ignoring available prior information.
Still, the possibility remains that the subjectivity introduced into data
analysis by the choice of the prior distribution, f (), could dominate the
objective information contained in the data. The counter-argument is that
the Bayesian approach is actually completely transparent in how it distinctly
separates out each component of the entire data analysis process what is
subjective and what is objective making it possible to assess, in an objective
manner, the inuence of the prior information on the nal result. It also allows
room for adaptation, in light of additional objective information.
Recursive Bayesian Estimation
It is in the latter sense noted above that the Bayesian approach provides
its most compelling advantage: recursive estimation. Consider a case that is
all too common in the chemical industry in which the value of a process variable, say viscosity of a polymer material, is to be determined experimentally.
The measured value of the process variable is subject to random variation
by virtue of the measurement device characteristics, but also intrinsically. In
particular, the true value of such a variable changes dynamically (i.e. from
one time instant to the next, the value will change because of dynamic operating conditions). If the objective is to estimate the current value of such a
variable, the frequentist approach as discussed in the earlier parts of this
chapter, is to obtain a random sample of size n from which the true value
will be estimated. Unfortunately, because of dynamics, only a single value is
obtainable at any point in time, tk , say x(tk ); at the next sampling point, the
observed value, x(tk+1 ) is, technically speaking, not the same as the previous
value, and in any event, the two can hardly be considered as independent of
each other. There is no realistic frequentist solution to this problem. However,
by postulating a prior distribution, one can obtain a posterior distribution
on the basis of a single data point; the resulting posterior distribution, which
now incorporates the information contained in the just-acquired data, can
now be used as the prior distribution for the next round. In this recursive
strategy, the admittedly subjective prior employed as an initial condition
is ultimately washed out of the system with progressive addition of objective,
but time-dependent data. A discussion of this type of problem is included in
the application case studies of Chapter 20.
Choosing Prior Distributions
The reader may have noticed that the choice of a Beta distribution as the
prior distribution, f (), for the Binomial/Bernoulli probability of success, p, is

526

Random Phenomena

particularly appropriate. Not only is the Beta random variable conveniently

scaled between 0 and 1 (just like p), the functional form of the Beta pdf is
perfectly paired with the Binomial pdf, when viewed from the perspective
of the unknown parameter, p. The two pdfs are repeated here for ease of
comparison:

n X
f (x|) =
(1 )nX
(14.162)
X
( + ) 1
f () =

(1 )1
(14.163)
()()
where, even though the rst is a function of the random variable, X, and
the second is a function of the parameter, , the two pdfs are seen to have
what is called a conjugate structure: multiplying one by the other results
in a posterior pdf where the conjugate structure is preserved. The Beta
pdf is therefore said to provide a conjugate prior for the Binomial sampling
distribution. The advantage of employing conjugate priors is therefore clear:
it simplies the computational work involved in determining the posterior
distribution.
For the Poisson P() random variable with unknown parameter = , the
conjugate prior is the Gamma distribution. Arranged side-by-side, the nature
of the conjugate structure becomes obvious:
f (x|)

f () =

e x
; x = 0, 1, 2, . . .
x!
1
e/ 1 ; 0 < <
()

(14.164)
(14.165)

(This particular prior will be used in a case study in Chapter 20.)

There are a few more such conjugate sampling and prior distribution pairs,
for example, the normal sampling distribution and the normal prior distribution; the exponential sampling distribution (with = 1/) and the gamma
prior distribution. Some of these are listed in Table 14.2.
We conclude by noting that while conjugate priors provide sampling distribution pairings that simplify analytical determination of posterior distributions, in fact, it is not necessary to seek conjugate priors for all Bayesian estimation problems. The most appropriate prior distribution should be selected
even if it does not form a conjugate pair with the sampling distribution.
Computational Issues
In more general cases, where any appropriate prior distribution is combined with the sampling distribution in question, determining the resulting
posterior distribution is not always a trivial matter because this exercise usually involves computing multi-dimensional integrals. Under such general conditions, it is not always possible to obtain explicit forms for the posterior

Estimation

527

distributions. In many cases, the only option is to obtain the required posterior distributions (as well as point estimates) numerically. Until recently,
the computational burden of numerically determining posterior distributions
for practical problems constituted a considerable obstacle to the application
of Bayesian techniques in estimation. With the introduction of the Markov
Chain Monte Carlo (MCMC)1 techniques, however, this computational issue
has essentially been resolved. There are now commercial software packages for
carrying out such numerical computations quite eciently.

14.7

Summary and Conclusions

Our study of statistical inference began in this chapter with estimation

the process by which unknown population parameters are determined from
limited sample informationbuilding directly on the foundation of sampling
theory laid down in Chapter 13. We were primarily concerned with techniques
for obtaining point estimates and how to quantify their precision, leading
naturally to interval estimates. The method of moments technique might have
appeared a bit ad-hoc (because it does not produce unique estimates), but it is
quite straightforward, intuitive and quite useful in providing initial estimates
that may be subsequently rened, if necessary. And, in any event, there are
many cases where these estimates coincide precisely with the more systematic
maximum likelihood estimates. On the other hand, most readers will probably
admit to sensing something of a much ado about nothing air surrounding
the method of maximum likelihood. For example, why go through all the
calculus and algebra only to discover the obviousthat the sample mean
is the MLE for the population mean? It is instructive, however, to keep in
mind that such simple closed form MLE solutions exist only in a handful
of cases. No such solution exists for the gamma distribution parameters, for
example, and denitely not for the Weibull distribution parameters. In a sense,
therefore, it ought to be taken as a reassuring sign that in obvious cases, the
maximum likelihood principle produces such intuitively obvious results; this
should give the reader condence that in cases where the results must be
computed numerically, these results can also be trusted.
The nature and characteristics of the distribution of random samples
especially their variancesmade it dicult to present any general results
about interval estimates for anything other beyond normal populations. Our
brief discussion of how to obtain such interval estimates for non-Gaussian populations (when samples sizes are small) should be understood as illustrations
1 Gilks W.R., Richardson S. and Spiegelhalter D.J. Markov Chain Monte Carlo in Practice, Chapman & Hall/CRC, 1996.

528

Random Phenomena

of what is possible when Gaussian approximations are invalid; generalized

discussions are practically impossible.
The discussion of criteria for selecting estimators at the beginning of the
chapter also may have seemed somewhat obvious and superuous, until we
encountered Bayesian estimation in the nal section of the chapter and the issue of bias and eciency became important. If the reader had been wondering
why anyone would ever consider using anything but an unbiased estimator,
or questioned what practical implications the concept of eciency could
possible have, Bayesian estimation put both issues in context simultaneously.
Sometimes, especially when sample sizes are small, it may make more sense
to opt for a biased estimator with a smaller variance. Such is the case with
Bayesian estimation. Admittedly, the discussion of Bayesian estimation in this
chapter was rather brief, but that does not make this estimation technique
any less important. We have tried to make up for this by providing several
exercises and a few illustrative application problems designed to expand on
the brief coverage.
Finally, we note that many of the seeds of hypothesis testing have been
sown already in this chapter; we shall see these fully developed in the next
chapter when we bring the discussion of statistical inference to a conclusion.
For now, a summary of the key results of this chapter is summarized in Table
14.1 along with some information about Bayesian estimation in Table 14.2.

REVIEW QUESTIONS
1. The objective of this chapter is to provide answers to what sorts of questions?
2. What is estimation?
3. What are the two aspects of estimation discussed in this chapter?
4. What is an estimator?
5. What is a point estimate and how is it dierent from an estimator?
6. What is an interval estimator?
7. What is an interval estimate and how is it dierent from an interval estimator?
8. What is the mathematical denition of an unbiased estimator?
9. What makes unbiasedness an intuitively appealing criterion for selecting estimators?

Estimation

529

10. Mathematically, what does it mean that one estimator is more ecient than
another?
11. What is the mathematical denition of a consistent sequence of estimators?
12. What is the basic principle behind the method of moments technique for obtaining point estimates?
13. Are method of moments estimators unique?
14. What is the likelihood function and how is it dierentiated from the joint pdf
of a random sample?
15. What is the log-likelihood function? Why is it often used in place of the likelihood function in obtaining point estimates?
16. Are maximum likelihood estimates always unbiased?
17. What is the invariance property of maximum likelihood estimators?
18. What are the asymptotic properties of maximum likelihood estimators?
19. What is needed in order to quantify the precision of point estimates?
20. What are the two main components of an interval estimate?
21. What is the general procedure for determining interval estimates for the mean
of a normal population with known?
22. What is the dierence between interval estimates for the mean of a normal population when is known and when is unknown?
23. What probability distribution is used to obtain interval estimates for the variance of a normal population?
24. Why is the condence interval around the point estimate of a normal population
variance not symmetric?
25. How can interval estimates of the dierence between two normal population
means be used to assess the equality of these means?
26. How does one obtain interval estimates for parameters from other non-Gaussian
populations when samples sizes are large and when they are small?
27. What are the distinguishing characteristics of Bayesian estimation?
28. What is a prior distribution, f (), and what role does it play in Bayesian estimation?

530

Random Phenomena

29. Apart from the prior distribution, what other types of probability distributions
are involved in Bayesian estimation, and how are they related?
30. What is the primary result of Bayesian estimation?
31. What is the Bayes estimator? What is the maximum `
a-posteriori estimator?
32. What are some of the controversial aspects of Bayesian estimation?
33. What is recursive Bayesian estimation?
34. What is a conjugate prior distribution?
35. What are some of the computational issues involved in the practical application
of Bayesian estimation, and how have they been resolved?

EXERCISES
Section 14.1
14.1 Given a random sample X1 , X2 , . . . , Xn , from a Gaussian N (, 2 ) population
and the sample
with unknown parameters, it is desired to use the sample mean, X,
2S 2 /n
variance, S 2 , to determine point estimates of the unknown parameters; X
is to be used to determine an interval estimate. The following data set was obtained
for this purpose:
{9.37, 8.86, 11.49, 9.57, 9.15, 9.10, 10.26, 9.87}
(i) In terms of the random sample, what is the estimator for ; and in terms of the
supplied data, what is the point estimate for ?
(ii) What are the boundary estimators, UL and UR , such that (UL , UR ) is an interval
estimator for ? What is the interval estimate?
14.2 Refer to Exercise 14.1.
(i) In terms of the random sample, what is the estimator for 2 , and what is the
point estimate?
(ii) Given the boundary estimators for 2 as:
UL =

(n 1)S 2
(n 1)S 2
; UR =
CL
CR

where CL = 19.0 and CR = 2.7, obtain an interval estimate for 2 .

14.3 Consider a random sample X1 , X2 , . . . , Xn from a population with unknown

Estimation

531

mean, , and variance, 2 , and the following estimators,

n
1
Xi ;
n i=1

n
1 2
X
n i=1 i

If the following relationship holds:

E(X 2 ) = V ar(X) + [E(X)]2
where V ar(X) is the variance of the random variable, determine the estimator for
V ar(X) in terms of M1 and M2 . Determine the point estimates, m1 , m2 and s2
respectively for M1 , M2 and V ar(X) from the following sample data:
{10.09, 15.73, 15.04, 5.55, 18.04, 17.95, 12.55, 9.66}
14.4 Consider a random sample X1 , X2 , . . . , Xn from a population with unknown
mean, , and variance, 2 ; dene the following estimator
n
1/n

g =
Xi
X
i=1

g and
(i) In terms of the supplied information, what is the estimator for A = ln X
g )n ?
for = (X
the sample mean, from the
g , A, eA and for X,
(ii) Determine point estimates for X
following sample data:
{7.91, 5.92, 4.53, 33.26, 24.13, 5.42, 16.96, 3.93}
Section 14.2
14.5 Consider a random sample X1 , X2 , . . . , Xn ;
(i) If the sample is from a Lognormal population, i.e. X L(, ), so that
E(ln X) =
and if the sample geometric mean is dened as
n
1/n

g =
X
Xi
i=1

g , not Xg , is unbiased for .

show that ln X
(ii) If the sample is from an exponential population dened in terms of , the rate
of occurrence of the underlying Poisson events, i.e., the pdf is given as:
f (x) = ex ; 0 < x <
and if the sample mean is dened as
n

= 1
X
Xi
n i=1

532

Random Phenomena

is not unbiased
show that this estimator is unbiased for 1/, but the estimator 1/X

for . (You do not have to compute the expectation of 1/X).

14.6 Given a random sample X1 , X2 , . . . , Xn from a general population with unknown mean, , and variance, 2 , that the sample variance,
S2 =

n
1
2
(Xi X)
n 1 i=1

is the sample mean, is unbiased for the population variance 2 , regardless

where X
of the underlying population.
proposed as
with mean ()
and variance 2 (),
14.7 Consider the estimator ,
is a biased estimator
an estimator for the unknown population parameter, . If
= , dene the bias B() as:
i.e., E()
= ()

B() = E()
and show that the mean squared estimation error, MSE, is given by
(
'
+ (B())2
)2 = 2 ()
E (

(14.166)

(14.167)

(Consider decomposing the estimation error into two components as follows: (

()]
+ [()
].)
) = [
Bernoulli population with un14.8 Given a random sample X1 , X2 , . . . , Xn from a
known parameter , then it is known rst, that X = n
i=1 Xi is a Binomial Bi(n, )
random variable; secondly, the estimator,
= X

n
is unbiased for . Consider a second estimator dened as:
= X +1

n+2
is biased for and determine the bias B(), as dened in Eq (14.166)
(i) Show that
in Exercise 14.7.
and
respectively.
(ii) Let V and V represent the variances of the estimators
Show that
2

n
V
V =
n+2
is more ecient
and hence establish that V < V , so that the biased estimator
the unbiased estimator, especially for small sample sizes,.
than ,
14.9 Given a random sample X1 , X2 , . . . , Xn from a population with mean , and
variance 2 , dene two statistics as follows:

n
1
Xi
n i=1
n

i=1

i Xi ; with

n

i=1

i = 1

Estimation

533

It was shown in the text that both statistics are unbiased for ; now show that X,
is the more ecient estimator of .
a special case of X,
with nite
14.10 Given a random sample X1 , X2 , . . . , Xn from a general population
= (n Xi )/n, is
mean , and nite variance 2 , show that the sample mean, X
i=1
consistent for regardless of the underlying population. (Hint: Invoke the Central
Limit Theorem.)
Section 14.3
14.11 Given a random sample, X1 , X2 , . . . , Xn from a Poisson P() population,
obtain an estimate for the population parameter, , on the basis of the second
moment estimator,
n
1 2
X
M2 =
n i=1 i
2 , is explicitly given by:
and show that this estimate,

1
2 = 1
4M2 + 1

2
2
14.12 Refer to Exercise 14.11 and consider the following sample of size n = 20 from
a Poisson P(2.10) population.
3
4

0
1

1
2

2
3

3
2

2
1

4
3

2
5

1
0

2
2

1 represent the estimate obtained from the rst moment, and

2 the estiLet
mate obtained from second moment, as in Exercise 14.11.
1 and
2 from the data and compare
(i) Determine specic numerical values for
them.
(ii) Consider the following weighted estimate, a combination of these two estimates:
=
1 + (1 )
2

for = 0.1, 0.2, 0.3, . . . , 0.9. Plot the nine values as a

Determine the values of
function of and compare these to the true value = 2.10.
14.13 Given a random sample, X1 , X2 , . . . , Xn from a negative binomial N Bi(k, p)
population,
(i) Obtain a method of moments estimate of 1 = k and 2 = p explicitly in terms
of the rst two moments, M1 and M2 .
(ii) If k is specied (as it is in many cases), obtain two separate expressions for
method of moments estimate for p based on M1 and M2 .
(iii) When k is specied, obtain a maximum likelihood estimate (MLE) for p and
compare it with the method of moments estimate obtained in (ii).
14.14 (i) In terms of the rst two moments M1 and M2 , obtain two separate method
of moments estimators for the unknown parameter in the geometric G(p) distribution. Given the following data from a G(0.25) population, determine numerical values
for the two dierent point estimates and compare them to the true population value.

534

Random Phenomena
1
1

6
1

2
5

2
14

9
2

14
2

2
6

1
3

2
1

1
1

(ii) Obtain the harmonic mean of the given data. As an estimate of the population
parameter p = 0.25, how does this estimate compare with the two method of moments estimates for p?
14.15 In terms of the rst two moments M1 and M2 , obtain two separate method of
moments estimators for the unknown parameter in the exponential E () distribution.
From the following data sampled from an exponential E (4) population, determine
numerical values of the two dierent point estimates and compare them to the true
population value.
6.99
0.52
10.36
5.75

2.84
0.67
1.66
0.12

0.41
2.72
3.26
6.51

3.75
5.22
1.78
4.05

2.16
16.65
1.31
1.52

14.16 On the basis of the rst two moments M1 and M2 , determine method of
moments estimates for the two parameters in the Beta B(, ) distribution.
14.17 Use the rst two moments M1 and M2 to determine two separate estimators
for the single Rayleigh R(b) distribution parameter.
14.18 On the basis of the rst two moments M1 and M2 , determine method of
moments estimates for the two parameters in the Gamma (, ) distribution.
14.19 Show that the likelihood function for the binomial random variable given in
Eq (14.66), i.e.,

n X
L(p) =
p (1 p)nX
X
is maximized when p = X/n, and hence establish the result stated in Eq (14.67).
14.20 Let X1 , X2 , . . . , Xn be a random sample from a geometric population with
Show
unknown parameter . Obtain the maximum likelihood estimate (MLE), .
that
=
E()
but,
so that is in fact biased for ,
E

1
1
=

so that 1/ is unbiased for 1/.

14.21 For the general negative binomial random variable N Bi(k, p) with both parameters unknown, determine maximum likelihood estimates given the random sample, (ki , Xi ); i = 1, 2, . . . n. When k is a xed and known constant, so that the only
the MLE for
unknown parameter is = p, show that under these circumstances, ,
= p, is such that:
=
E()

Estimation
but,
E

535

1
1
=

14.22 Given a random sample X1 , X2 , . . . , Xn from a Gamma (, ) population,

(i) Determine the MLE for the parameter when is specied.
(ii) When is not specied, there is no closed form solution for the maximum likelihood estimates for the two unknown parameters. However, show, without solving
the simultaneous maximum likelihood equations, that the method of moments estimates are dierent from the maximum likelihood estimates.
Sections 14.4 and 14.5
14.23 The average of a sample of size n = 20 from a normal population with
= 74.4; the sample variance
unknown mean and variance 2 = 5 was obtained as x
was obtained as s2 = 5.6
(i) Determine an interval (
x w, x
+ w) such that the probability that the true value
of lies in this interval is 0.90.
(ii) Repeat (i) when the desired probability is 0.95
(iii) Repeat (i) and (ii) when the population variance is not given.
14.24 Refer to the data given in Table 1.2 in Chapter 1. Consider this a random
sample of size n = 60 drawn from a Poisson population with unknown parameter .
The sample average was obtained in that chapter as x
= 1.02 and is to be used as
of the unknown parameter. Determine the precision of this estimate
an estimate
by obtaining an interval such that
P (
xw x
+ w) = 0.95
Repeat for probability values 0.9 and 0.99.
14.25 Let X1 , X2 , . . . , Xn be a random sample from a normal N (, 2 ) population.
is the sample average)
With the sample variance S 2 dened in Eq (14.99), (where X
we know that the statistic:
(n 1)S 2
C=
2
is a chi-squared distributed random variable; more specically, C 2 (n 1). Show
that the estimator S 2 is unbiased for the population variance, 2 , via the expected
value of C.
14.26 Given the following interval estimates for the mean of a random sample from
a normal population with unknown mean but known variance, 2 , determine the
implied condence levels:
2.575/n; (iii) X
3.000/n
1.645/n; (ii) X
(i) X
14.27 Let X1 , X2 , . . . , Xn , be a random sample from a normal population, with
2
) which is independent of Y1 , Y2 , . . . , Ym , another random
the distribution N (X , X
sample from a dierent normal population, with the distribution N (Y , Y2 ), where
the sample sizes, n and m need not be equal (i.e., n = m).
(i) Obtain the pdf for the random variable D = X Y .

536

Random Phenomena

is the MLE for X , and Y is the MLE for Y , show that D,

dened as:
(ii) If X
=X
Y
D
is the MLE for XY = X Y
14.28 Let X1 , X2 , . . . , Xn , be a random sample from a normal population, with
unknown mean, and variance 2 = 5. The population mean is to be estimated with
< R ,
the sample mean such that the 95% condence interval will be L X
an interval of total width w = R L .
(i) Determine the sample size n required for w = 0.5.
(ii) If the population variance doubles to 2 = 10, what value of n will be required
to maintain the same width for the 95% condence interval?
(iii) If the population variance doubles but the sample size obtained in (i) is retained,
what is the width of the 95% condence interval?
= 10.5 and a
14.29 A random sample of size n = 100 generated a sample mean, X
sample variance of s2 = 1.5. Estimate the unknown mean with a 90% condence
interval, a 95% condence interval, and a 99% condence interval. Comment on how
the width of the estimation intervals changes in relation to the desired condence
level and why this is entirely reasonable.
14.30 An opinion poll based on a sample of 50 subjects estimated p, the proportion
of the population in favor of the proposition, as 0.72.
(i) Estimate the true proportion, , with a 95% condence interval. State any assumptions you may have to make in answering this question.
(ii) If the true population proportion is suspected to be = 0.8, and the estimate
from an opinion poll is to be determined to within 0.05 with 95% condence, how
many people, n, should be sampled?
(iii) If the proportion is to be estimated to within the same margin of 0.05, but
with 90% condence, what is the value of n required? Comment on the eect that
reducing the condence level has on the sample size, n required to achieve the desired precision.
= 38.8 and Y = 42.4 were obtained from ran14.31 The sample averages X
dom sample taken from two independent populations of respective sizes nx = 120
and ny = 90. The corresponding sample standard deviations were obtained as
s2x = 20; s2y = 35. Determine a 95% condence interval estimate for the dierence
xy using the dierence between the sample averages. Does this interval include zero?
14.32 Samples are to be taken from two dierent normal populations, one with variance 12 = 10, the other with a variance twice the magnitude of the rst one. If the
dierence between the two population means is to be estimated to within 2, with
95% condence, determine the sample size required, assuming that n1 = n2 = n.
Section 14.6
14.33 In estimating an unknown binomial parameter, the posterior distribution
arising from using a Beta B(2, 3) prior distribution was given in the text as Eq

Estimation
(14.157), i.e.,
f (|x) = C 12

537

n x+1
(1 )nx+2

(i) Show that in nal form, with the constant C evaluated, this posterior distribution
is:
(n + 4)!
x+1 (1 )nx+2
f (|x) =
(x + 1)!(n x + 2)!
hence conrming Eq (14.158). (Hint: Exploit the structural similarity between this
pdf and the Beta pdf.)
(ii) If x is the actual number of successes obtained in n trials, it is known that the
estimate = x/n is unbiased for . It was stated in the text that the Bayes and
MAP estimates, are, respectively,
x+2
x+1
; and =
B =
n+5
n+3

Show that these two estimates are both biased, but are both more ecient than .
Which of the three is the most ecient?
14.34 Let X1 , X2 , . . . , Xn , be a random sample from a Poisson distribution with
unknown parameter, .
(i) To estimate the unknown parameter with this sample, along with a Gamma
(a, b) prior distribution for , rst show that the posterior distribution f (|x) is a
Gamma (a , b ) distribution with
a =

Xi + a; and b =

i=1

1
n+

1
b

Hence show that the Bayes estimator, B = E(|X), is a weighted sum of the sample
and the prior distribution mean, p , i.e.,
mean, X
+ (1 w)p
B = wX
with the weight, w = nb/(nb + 1).
(ii) Now show that:

V ar(B ) < V ar(X)

always, but especially when n is small.
14.35 The rst 10 samples of the number of inclusions on 1-sq meter glass sheets,
shown below as the set I10 , has been extracted from the full data set in Table 1.2.
I10 = {0, 1, 1, 1, 0, 0, 1, 0, 2, 2}
Consider this as a sample from a Poisson population with true population parameter
= 1.
(i) Use the rst ve entries to obtain a maximum likelihood estimate of ; compute
the sample variance. Then refer to Exercise 14.34 and use the results there to obtain
the Bayes estimate, again using the rst ve entries along with a prior distribution
chosen as a Gamma (2, 1). Obtain the variance of this Bayes estimate. Clearly
the sample size is too small, but still use the Gaussian approximation to obtain
approximate 95% condence intervals around these estimates. Compare these two
dierent estimates to the true value.
(ii) Repeat (i) this time using the entire 10 samples.

538

Random Phenomena

APPLICATION PROBLEMS
14.36 A cohort of 100 patients under the age of 35 years (the Younger group),
and another cohort of the same size, but 35 years and older (the Older group),
participated in a clinical study where each patient received ve embryos in an invitro fertilization (IVF) treatment cycle. The result from Assisted Reproductive
Technologies clinic where the study took place is shown in the table below. The
data shows x, the number of live births per delivered pregnancy, along with how
many in each group had the pregnancy outcome of x.
x
No. of live
births in a
delivered
pregnancy
0
1
2
3
4
5

yO
Total no. of
older patients
(out of 100)
with pregnancy outcome x
32
41
21
5
1
0

yY
Total no. of
younger patients
(out of 100)
with pregnancy outcome x
8
25
35
23
8
1

On the postulate that these data represent random samples from the binomial
Bi(n, O ) population for the Older group, and Bi(n, Y ) for the Younger group,
obtain 95% condence interval estimates of both parameters, O and Y .
Physiologically, these parameters represent the single embryo probability of success (i.e., resulting in a live birth at the end of the treatment cycle) for the patients
in each group. Comment on whether or not the results of this clinical study indicate that these cohort groups have dierent IVF treatment success rates, on average.
14.37 The number of contaminant particles (aws) found on each standard size silicon wafer produced at a certain manufacturing site is a random variable, X. In order
to characterize this random variable, a random sample of 30 silicon wafers selected
by a quality control engineer and examined for aws resulted in the data shown in
the table below, a record of the number of aws found on each wafer.
4
3
3

1
0
4

2
0
1

3
2
1

2
3
2

1
0
2

2
3
5

4
2
3

0
1
1

1
2
1

Postulate an appropriate theoretical probability model for this random variable

and estimate the unknown model parameter(s). Include an estimate of the precision
of the estimated parameters. State any assumptions you may need to make in answering this question.
14.38 The following data set, from a study by Lucas (1985)2 (see also Exercises
12.3, 12.25 and 13.22), shows the number of accidents occurring per quarter (three
2 Lucas

J. M., (1985). Counted Data CUSUMs, Technometrics, 27, 129144.

Estimation

539

months), over a 10-year period, at a DuPont company facility, separated into two
periods: Period I is the rst ve-year period of the study; Period II, the second veyear period.

5
4
2
5
6

Period I
5 10 8
5 7
3
8 6
9
6 5 10
3 3 10

3
1
7
1
4

Period II
4 2 0
3 2 2
7 1 4
2 2 1
4 4 4

(i) Consider that the entire data set constitutes a random sample of size 40 from
a single Poisson population with unknown parameter. Estimate the parameter with
a 95% condence interval.
(ii) Now consider the data for each period as representing two dierent random
samples of size 20 each, from two dierent Poisson populations, with unknown parameters 1 for Period I, 2 for Period II. Estimate these parameters with separate,
approximate 95% condence intervals. Compare these two interval estimates and
comment on whether or not these two populations are indeed dierent. If the populations appear dierent, what do you think may have happened between Period I
and Period II at the DuPont company facility that was the site of this study?
14.39 An exotic u virus with a long incubation period reappears every year during
the long u season. Unfortunately, there is only a probability p that an infected
patient will show symptoms within the rst month; as such the early symptomatic
patients constitute only the leading edge of the infected members of the population. Assuming, for the sake of simplicity, that once infected, the total number
of infected individuals does not increase (the virus is minimally contagious), and
assuming that all symptomatic patients eventually come to the same hospital, the
following data was obtained over a period of ve years by an epidemiologist working
with the local hospital doctors. NE is the number of early symptomatic patients;
NT is the total number of infected patients treated that year. The unknown probability p is to be determined from this data in order to enable doctors to prepare for
the viruss reappearance the following year.

Year
1
2
3
4
5

Early
Symptomatics
NE
5
3
3
7
2

Total
Infected
NT
7
8
10
7
8

(i) Why is this a negative binomial phenomenon? Determine the values of the
negative binomial parameter k and the random variable X from this data set.
(ii) Obtain an expression for the maximum likelihood estimate of p in terms of a
general random sample of ki , Xi ; i = 1, 2, . . . , n. Why is it not possible to use the
method of moments to estimate p in this case?
(iii) Determine from the data an actual estimate p. Use this estimate to generate a 7 7 table of probabilities f (x|k) for values of k = 1, 2, 3, 4, 5, 6, 7 and

540

Random Phenomena

x = 0, 1, 2, 4, 5, 6. Convert this table to a table of NE values versus probabilities

of observing NT infected patients every cycle, for each given NE .
14.40 The data below, taken from Greenwood and Yule, (1920)3 , shows the frequency of accidents occurring, over a ve-week period, to 647 women making high
explosives during World War I.
Number
of Accidents
0
1
2
3
4
5+

Observed
Frequency
447
132
42
21
3
2

(i) If X is the random variable representing the number of accidents, determine

the mean and the variance of this clearly Poisson-like random variable, and conrm
that this is an over-dispersed Poisson variable for which the Poisson population
parameter, rather than being constant, varies across the population.
(ii) For over-dispersed Poisson variables such as this, the more appropriate model is
the negative binomial N Bi(, p). Determine the method of moments estimates for
the unknown parameters and p for this data set.
14.41 The table below shows the time in months between occurrences of safety violations for three operators, A, B, and C, working in a toll manufacturing
facility.
A
B
C

1.31
1.94
0.79

0.15
3.21
1.22

3.02
2.91
0.65

3.17
1.66
3.90

4.84
1.51
0.18

0.71
0.30
0.57

0.70
0.05
7.26

1.41
1.62
0.43

2.68
6.75
0.96

0.68
1.29
3.76

Postulate an appropriate probability model for the phenomenon in question,

treat the data set as three random samples of size n = 10 each, from the three
dierent populations represented by each operator. Obtain precise 95% condence
interval estimates of the unknown population parameters. Interpret your results in
terms of any dierences that might exist between the safety performances of the
three operators.
14.42 The data set in the table below is the time (in months) from receipt to
publication (sometimes known as time-to-publication) of 85 papers published in the
January 2004 issue of a leading chemical engineering research journal.

3 Greenwood M. and Yule, G. U. (1920) An enquiry into the nature of frequency distributions representative of multiple happenings with particular reference of multiple attacks
of disease or of repeated accidents. Journal Royal Statistical Society 83:255-279.

Estimation
19.2
9.0
17.2
8.2
4.5
13.5
20.7
7.9
19.5
8.8
18.7
7.4
9.7
13.7
8.1
8.4
10.8

15.1
5.3
12.0
3.0
18.5
5.8
6.8
14.5
3.3
11.1
16.4
7.3
7.4
7.3
5.2
10.2
3.1

9.6
12.9
17.3
6.0
24.3
21.3
19.3
2.5
9.1
8.1
9.8
15.4
15.7
8.2
8.8
7.2
12.8

541
4.2
4.2
7.8
9.5
3.9
8.7
5.9
5.3
1.8
10.1
10.0
18.7
5.6
3.3
7.3
11.3
2.9

5.4
15.2
8.0
11.7
17.2
4.0
3.8
7.4
5.3
10.6
15.2
11.5
5.9
20.1
12.2
12.0
8.8

Postulate an appropriate probability model for this random variable, justifying

your choice clearly but succinctly. Consider this data as a random sample from
the population in question and determine estimates of the unknown population parameters using whatever technique is most convenient. Plot on the same graph,
a theoretical model prediction against the data histogram and comment on your
model t to the data. In particular compare the model prediction of the most
popular time-to-publication, x with the corresponding data value; also compare
the probability that a paper will take longer than x months to publish with the
proportion of papers from the data table that took longer than x months to publish.
14.43 The data table below shows x, a count of the number of species, x = 1, 2, . . . 24,
and the associated number of Malayan butteries that have x number of species.
In Fisher et al., (1943)4 , where the data was rst published and analyzed, it was
proposed that the appropriate model for the phenomenon is the logarithmic series
distribution (see Exercise 8.13), with the pdf:

f (x) =

px
; 0 < p < 1; x = 1, 2, . . . ,
x

where
=

1
ln(1 p)

4 Fisher, R. A., S. Corbet, and C. B. Williams. (1943). The relation between the number
of species and the number of individuals in a random sample of an animal population.
Journal of Animal Ecology, 1943: 4258.

542

Random Phenomena
x
No of species
(x)
Observed Frequency
x
No of species
(x)
Observed Frequency
x
No of species
(x)
Observed Frequency

118

It can be shown that for this random variable and its pdf,
E(X) = p/(1 p)
V ar(X) = p(1 p)(1 p)2
Show that the MLE of the unknown population parameter p and the method of
moments estimator based on the rst moment coincide. Obtain an estimate of this
population parameter for this data set. Compare the model prediction to the data.
14.44 The data in the table below, on the wall thickness (in ins) of cast aluminum
cylinder heads used in aircraft engine cooling jackets, is from Mee (1990)5 .
0.223
0.201

0.228
0.223

0.214
0.224

0.193
0.231

0.223
0.237

0.213
0.217

0.218
0.204

0.215
0.226

0.233
0.219

(i) Consider this as a random sample from a normal population with unknown
parameters and 2 . Determine 95% condence interval estimates of both the mean
and the variance of the wall thickness.
(ii) If a customer requires that these wall thicknesses be made to within the specications, 0.220 0.030 ins, what is the probability that the manufacturer will meet
these specications? State any assumptions required in answering this question.
14.45 The intrinsic variations in the measured amount of pollution contained in
water samples from rivers and streams in a mining region of West Central United
States is known to be a normally distributed random variable with a fairly stable
standard deviation of 5 milligrams of solids per liter. As an EPA inspector who
wishes to test a random selection of n water samples in order to determine the
mean daily rate of pollution within 1 milligram per liter with 95% condence, how
many water samples will you need to take? A selection of 6 randomly selected water
samples returned a mean value of 40 milligrams per liter and what seems like an
excessive variance of 30 (mg/liter)2 . Determine a 95% condence interval around
this estimate of 2 and comment on whether or not this value is excessive compared
to the assumed population value.
14.46 The time (in weeks) between occurrences of minor safety incidents in a
5 Mee, R. W., (1990). An improved procedure for screening based on a correlated, normally distributed variable, Technometrics, 32, 331337.

Estimation

543

Research and Development laboratory is known to be a random variable X with an

exponential distribution
f (x) =

1 x/
e
;0 < x <

where the characteristic parameter, , is to be determined from a random sample

X1 , X2 , . . . , Xn , using Bayesian principles.
(i) Replace 1/ with in the pdf, and use the following gamma (a, b) distribution
as the prior distribution for :
f () = Ca1 e/b
where C is the usual normalization constant, and a, b are known constants. Obtain
the posterior distribution f (|x1 , x2 , . . . , xn ) and from this, obtain an expression
for B , the posterior mean (i.e., the Bayes estimate of ); from this obtain B , the
corresponding estimate of . Also obtain , the MAP estimate, and from this obtain
the corresponding estimate of . Compare these two estimates of .
(ii) The data set shown here was extracted from recent safety incidents records of
the laboratory discussed above. It shows the time (in weeks) between occurrences of
the last 11 minor safety incidents.
1.31
0.71

0.15
0.70

3.02
1.41

3.17
2.68

4.84
0.68

Considering this data set as a specic realization of the random sample

X1 , X2 , . . . , X10 , obtain the maximum likelihood estimate, M L of the characteristic
parameter . Employing as a prior distribution for = 1/, the Gamma distribution
given in (i) above, with parameters a = 2, b = 1, obtain the Bayes estimate B , as
well as the MAP estimate, . Compare these three estimates.
(iii) Plot on one graph, a gamma (2, 1) and a gamma (3, 1) pdf; compare them.
Now repeat part (ii) above but this time use a prior distribution with parameters
a = 3, b = 1. Compare the Bayes and MAP estimates obtained here with those obtained in (ii) and comment on the eect of the prior distributions on these estimates.
14.47 The number of contaminant particles (aws) found on each standard size
silicon wafer produced at a certain manufacturing site is a Poisson distributed random variable, X, with an unknown mean . To estimate this parameter, a sample
of n wafers from a manufactured lot was examined and the number of contaminant
particles found on each wafer recorded.
(i) Given that this data set constitutes a random sample X1 , X2 , . . . , Xn , Use a
gamma (, ) distribution as a prior pdf for the unknown parameter, , and obtain
B , the Bayes estimate for (i.e. the posterior mean) and show that if M L is the
maximum likelihood estimate for , then,
lim B = M L

(ii) A sample of 30 silicon wafers was examined for aws and the result (the number
of aws found on each wafer) is displayed in the table below.

544

Random Phenomena
3
4
3

1
0
4

2
0
1

3
2
1

2
0
2

1
0
2

2
0
5

2
2
3

0
1
1

1
2
1

Use a gamma (2, 2) distribution as the prior distribution for and obtain, using
this data set:
(a) the maximum likelihood estimate;
(b) the Bayes estimate, and
(c) Repeat (a) and (b) using only the rst 10 data points in the rst row of the data
table. Comment on how an increase in the number of available data points aects
parameter estimation in this particular case.
14.48 Consider the case in which k , the true value of a process variable at time
instant k, is measured as yk , i.e.,
yk = k + k

(14.168)

where k , the measurement noise, is usually considered random. The standard procedure for obtaining a good estimate of k involves taking repeated measurements
and averaging.
However, many circumstances arise in practice when such a strategy is infeasible primarily because of signicant process dynamics. Under these circumstances,
the process variable changes signicantly with time during the period of repeated
sampling; the repeated measurements thus provide information about the process
variable at dierent time instants and not true replicates of the desired measurement
at a specic time instant, k. Decisions must therefore be made on the true value, k ,
from the single, available measurement yk a non-standard problem which may be
solved using the Bayesian approach as follows:
(i) Theory: Consider yk as a realization of the random variable, Yk , possessing a
normal N (k , 2 ) distribution with unknown mean k , and variance 2 ; then consider
that the (unknown) process dynamics can be approximated by the simple random
walk model:
(14.169)
k = k1 + wk
where wk , the process noise, is a sequence of independent realizations of the zero
mean, Gaussian random variable, W , with a variance v 2 ; i.e., W N (0, v 2 ).
This process dynamic model is equivalent to declaring that k , the unknown true
mean value of Yk , has a prior pdf N (k1 , v 2 ); the measurement equation above, Eq
(14.168), implies that the sampling distribution for Yk is given by:

(yk k )2
1
f (yk |k ) = exp
2 2
2
Combine this with the prior distribution and obtain an expression for the posterior
distribution f (k |yk ) and show that the result is a Gaussian pdf with mean k ,
variance
2 given by:
k

k1 + (1 )yk

(14.170)

2 + v2

(14.171)

Estimation
and

2 =

1
2

1
+

545

(14.172)

1
v2

Thus, by adopting the penultimate Bayes estimate k1 as an estimate for k1

(since it is also unknown) obtain the recursive formula:
k =
k1 + (1 )yk

(14.173)

for estimating the true value k from the single data point yk . This expression is
recognizable to engineers as the popular discrete, (rst order) exponential lter, for
which is usually taken as a tuning parameter.
(ii)Application: Apply the result in part (i) above to lter the following raw
data, representing 25 hourly measurements of a polymer products solution viscosity (in scaled, coded units), using = 0.20, and the initial condition 1 = 20.00.
k
1
2
3
4
5
6
7
8
9
10
11
12
13

yk
20.82
20.92
21.46
22.15
19.76
21.91
22.13
24.26
20.26
20.35
18.32
19.24
19.99

k
14
15
16
17
18
19
20
21
22
23
24
25

yk
18.65
21.48
21.85
22.34
20.35
20.32
22.10
20.69
19.74
20.27
23.33
19.69

Compare a time sequence plot of the resulting ltered value, k , with that of
the raw data. Repeat with = 0.80 (and the same initial condition) and comment
on which lter parameter value (0.20 or 0.80) provides estimates that are more
representative of the dynamic behavior exhibited by the raw data.
14.49 Padgett and Spurrier (1990)6 obtained the following data set for the breaking
strengths (in GPa) of carbon bers used in making composite materials.
1.4
3.2
2.2
1.8
1.6

3.7
1.6
1.2
0.4
1.1

3.0
0.8
5.1
3.7
2.0

1.4
5.6
2.5
2.5
1.6

1.0
1.7
1.2
0.9
2.1

2.8
1.6
3.5
1.6
1.9

4.9
2.0
2.2
2.8
2.9

3.7
1.2
1.7
4.7
2.8

1.8
1.1
1.3
2.0
2.1

1.6
1.7
4.4
1.8
3.7

It is known that this phenomenon is well-modeled by the Weibull W (, ) distribution; the objective is to determine values for these unknown population parameters from this sample data, considered as a random sample with n = 50. However,
6 Padgett, W.J. and J. D. Spurrier, (1990). Shewhart-type charts for percentiles of
strength distributions. J of Quality Tech. 22, 283388.

546

Random Phenomena

obtaining Weibull population parameters from sample data is particularly dicult.

The following is a method based on the cumulative distribution function, cdf, F (x).
From the Weibull pdf, (and from the derivation presented in Chapter 9), rst
show that the Weibull cdf is given by
F (x) = 1 e(x/)

(14.174)

and hence show that

(ln x ln ) = ln {ln[1 F (x)]}

(14.175)

Observe therefore that given F (x) and x, a plot of ln {ln[1 F (x)]} versus ln X
should result in a straight line with slope = and intercept = ln , from which the
appropriate values may be determined for the two unknown parameters.
Employ the outlined technique to determine from the supplied data your best
estimate of the unknown parameters. Compare your results to the true values
= 2.5 and = 2.0 used by Padgett and Spurrier in their analysis.
14.50 Polygraphs, the so-called lie-detector machines based on physiological measurements such as blood pressure, respiration and perspiration, are used frequently
in government agencies and other businesses where employees handle highly classied information. While polygraph test results are sometimes permitted in some
state courts, they are not admissible in federal courts in part because of potential
errors and the implications of such errors on the fairness of the justice system.
Since the basic premise of these machines is the measurement of human physiological variables, it is possible to evaluate the performance of polygraphs in somewhat
the same manner as one would any other medical diagnostic machine. (See Phillips,
Brett, and Beary, (1986)7 for one such study carried out by a group of physicians.)
The data shown below is a compilation of the result of an extensive study (similar
to the Phillips et al., study) in which a group of volunteers were divided into two
equal-numbered subgroups of truth tellers and liars. The tests were repeated
56 times over a period of two weeks and the results tabulated as shown: XA is the
fraction of the truth tellers falsely identied as liars by a Type A polygraph
machine (i.e., false positives); XB is the set of corresponding results for the same
subjects, under conditions as close to identical as possible using a Type B polygraph
machine. Conversely, YA is the fraction of liars misidentied as truth-tellers
(i.e., false negatives) by the Type A machine with YB as the corresponding results
using the Type B machine.
Postulate a reasonable probability model for the random phenomenon in question, providing a brief but adequate justication for your choice. Estimate the model
parameters for the four populations and discuss how well your model ts the data.

7 M. Phillips, A. Brett, and J. Beary, (1986). Lie Detectors Can Make a Liar Out of
You, Discover, June 1986, p. 7

Estimation
XA
0.128
0.264
0.422
0.374
0.240
0.223
0.281
0.316
0.341
0.397
0.037
0.097
0.112
0.216
0.265
0.225
0.253
0.211
0.301
0.469
0.410
0.454
0.278
0.236
0.118
0.109
0.035
0.269

Polygraph Data
YA
XB
0.161 0.161
0.117 0.286
0.067 0.269
0.158 0.380
0.105 0.498
0.036 0.328
0.210 0.159
0.378 0.391
0.283 0.154
0.166 0.216
0.212 0.479
0.318 0.049
0.144 0.377
0.281 0.327
0.238 0.563
0.043 0.169
0.200 0.541
0.299 0.338
0.106 0.438
0.161 0.242
0.151 0.461
0.200 0.694
0.129 0.439
0.222 0.194
0.245 0.379
0.308 0.368
0.019 0.426
0.146 0.597

547
YB
0.064
0.036
0.214
0.361
0.243
0.235
0.024
0.114
0.067
0.265
0.378
0.004
0.043
0.271
0.173
0.040
0.410
0.031
0.131
0.023
0.159
0.265
0.013
0.190
0.030
0.069
0.127
0.144

548

Random Phenomena
XA
0.175
0.425
0.119
0.380
0.234
0.323
0.356
0.401
0.444
0.326
0.484
0.280
0.435
0.172
0.235
0.418
0.366
0.077
0.352
0.231
0.175
0.290
0.099
0.254
0.556
0.407
0.191
0.232

Polygraph Data
YA
XB
0.368 0.441
0.327 0.412
0.698 0.295
0.054 0.136
0.070 0.438
0.057 0.445
0.506 0.239
0.142 0.207
0.356 0.251
0.128 0.430
0.108 0.195
0.281 0.429
0.211 0.581
0.333 0.278
0.100 0.151
0.114 0.374
0.083 0.638
0.251 0.187
0.085 0.680
0.225 0.198
0.325 0.533
0.352 0.187
0.185 0.340
0.287 0.391
0.185 0.318
0.109 0.102
0.049 0.512
0.076 0.356

YB
0.024
0.218
0.057
0.081
0.085
0.197
0.111
0.011
0.029
0.229
0.546
0.039
0.061
0.136
0.014
0.055
0.031
0.239
0.106
0.066
0.132
0.240
0.070
0.197
0.071
0.351
0.072
0.048

Proportion, p

Binomial

Variance, 2

(n < 30)

Mean,

Population
Parameter

TABLE 14.1:

X
n

S2 =
n1

i=1 (Xi X)

pq/n

N/A

Summary of estimation results

Point
Expected Variance
Estimator
Value

E()
V ar()
n
X
= i=1 i
X

2 /n
n

s/ n
n

p z/2

p
q
n

(n1)S 2
21/2 (n1)

s
n

2
< 2 <
%

(n1)S 2
2/2 (n1)

t/2 (n 1)
X

Conf. Interval
Estimator
(1 ) 100%

z/2
X

Estimation
549

550

Random Phenomena

TABLE 14.2:

Some population parameters and conjugate prior

distributions appropriate for their Bayesian estimation
Population
Sampling
Conjugate prior
and Parameter Distribution
Distribution
f (x|)
f ()
Binomial
=p
Poisson
=
Exponential
= 1/
Gaussian
=
Gaussian
= 2

Beta, B(, )
C2 1 (1 )1

C1 x (1 )nx

Gamma, (, )
C2 1 e/

( *i=1 xi ) e
n
i=1 xi !

1
2

n
i=1

exp

()1/2 2

Gamma, (, )
C2 1 e/

xi )

i=1 (xi )
22

exp

i=1 (xi )

Gaussian,N (, v 2 )
)2
1
exp (
2v 2
v 2
Inverse Gamma,
IG(, )
C

+1 e

Chapter 15
Hypothesis Testing

15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.2.1 Terminology and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Statistical Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Test Statistic, Critical Region and Signicance Level . . . . . . . . . .
Potential Errors, Risks, and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sensitivity and Specicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.2.2 General Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.3 Concerning Single Mean of a Normal Population . . . . . . . . . . . . . . . . . . . . . . . .
15.3.1 Known; the z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Using MINITAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.3.2 Unknown; the t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Using MINITAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.3.3 Condence Intervals and Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . .
15.4 Concerning Two Normal Population Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.4.1 Population Standard Deviations Known . . . . . . . . . . . . . . . . . . . . . . . . . .
15.4.2 Population Standard Deviations Unknown . . . . . . . . . . . . . . . . . . . . . . .
Equal standard deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Using MINITAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Unequal standard deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Condence Intervals and Two-Sample tests . . . . . . . . . . . . . . . . . . . .
An Illustrative Example: The Yield Improvement Problem . . . .
15.4.3 Paired Dierences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.5 Determining , Power, and Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.5.1 and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.5.2 Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.5.3 and Power for Lower-Tailed and Two-Sided Tests . . . . . . . . . . . . .
15.5.4 General Power and Sample Size Considerations . . . . . . . . . . . . . . . . . .
15.6 Concerning Variances of Normal Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.6.1 Single Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.6.2 Two Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.7 Concerning Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.7.1 Single Population Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Large Sample Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exact Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.7.2 Two Population Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.8 Concerning Non-Gaussian Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.8.1 Large Sample Test for Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.8.2 Small Sample Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.9 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.9.1 General Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.9.2 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Normal Population; Known Variance . . . . . . . . . . . . . . . . . . . . . . . . . . .

552
553
554
554
556
557
558
559
560
561
563
567
570
573
573
576
576
578
578
580
581
581
582
585
589
591
593
596
598
599
600
601
603
606
607
608
609
610
612
613
614
616
616
618
619

551

552

Random Phenomena

Normal Population; Unknown Variance . . . . . . . . . . . . . . . . . . . . . . . .

15.9.3 Asymptotic Distribution for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.10Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.11Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

620
622
623
624
626
629
637

The great tragedy of science

the slaying of a beautiful hypothesis by an ugly fact.
T. H. Huxley (18251895)

Since turning our attention fully to Statistics in Part IV, our focus has been
on characterizing the population completely, using nite-sized samples. The
discussion that began with sampling in Chapter 13, providing the mathematical foundation for characterizing the variability in random samples, and which
continued with estimation in Chapter 14, providing techniques for determining
values for populations parameters, concludes in this chapter with hypothesis
testing. This nal tier of the statistical inference edice is concerned with
making and testing assertive statements about the population. Such
statements are often necessary to solve practical problems, or to answer questions of practical importance; and this chapter is devoted to presenting the
principles, practice and mechanics of testing the validity of hypothesized statements regarding the distribution of populations. The chapter covers extensive
ground from traditional techniques applied to traditional Gaussian problems, to non-Gaussian problems and some non-traditional techniques; it ends
with a brief but frank discussion of persistent criticisms of hypothesis tests
and some practical recommendations for handling such criticisms.

15.1

Introduction

We begin our discussion by returning to the rst problem presented in

Chapter 1 concerning yields from two chemical processes; we wish to use it to
illustrate the central issues with hypothesis testing. Recall that the problem
requires that we decide which process, the challenger, A, or the incumbent,
B, should be chosen for commercial operation. The decision is to be based
on economically driven comparisons that translate to answering the following
mathematical questions about the yields YA and YB :
1. Is YA 74.5 and YB 74.5, consistently?
2. Is YA > YB ?
3. If yes, is YA YB > 2?

Hypothesis Testing

553

To deal with the problem systematically, inherent random variability compels us to start by characterizing the populations fully with pdfs which are
then used to answer these questions. This requires that we postulate an appropriate probability model, and determine values for the unknown parameters
from sample data.
Here is what we know thus far (from Chapters 1 and 12, and from the various examples in Chapter 14): we have plotted histograms of the data and postulated that these are samples from Gaussian-distributed populations; we have
computed sample averages, yA , yB , and sample standard deviations, sA , sB ;
and in the various Chapter 14 examples, we have obtained point and interval estimates for the population means A , B , and the population standard
deviations A , B .
But by themselves, these results are not quite sucient to answer the questions posed above. To answer the questions, consider the following statements
and the implications of being able to conrm or refute them:
1. YA is a random variable characterized by a normal population with mean
value 75.5 and standard deviation 1.5, i.e., YA N (75.5, 1.52); similarly,
YB N (72.5, 2.52); as a consequence,
2. The random variables, YA and YB , are not from the same distribution
2
2
because A = B and A
= B
; in particular, A > B ;
3. Furthermore, A B > 2.
This is a collection of assertions about these two populations, statements
which, if conrmed, will enable us answer the questions raised. For example,
Statement #1 will allow us to answer Question 1 by making it possible to
compute the probabilities P (YA 74.5) and P (YB 74.5); Statement #2 will
allow us to answer Question 2, and Statement #3, Question 3. How practical
problems are formulated as statements of this type, and how such statements
are conrmed or refuted, all fall under the formal subject matter of hypothesis
testing. In general, the validity of such statements (or other assumptions about
the population from which the sample data were obtained) is checked by
1. Proposing an appropriate statistical hypothesis about the problem at
hand; and

2. Testing this hypothesis against the evidence contained in the data.

554

15.2
15.2.1

Random Phenomena

Basic Concepts
Terminology and Denitions

Before launching into a discussion of the principles and mechanics of hypothesis testing, it is important to introduce rst some terminology and denitions.
Statistical Hypothesis
A statistical hypothesis is a statement (an assertion or postulate) about
the distribution of one or more populations. (Theoretically, the statistical
hypothesis is a statement regarding one or more postulated distributions for
the random variable X distributions for which the statement is presumed to
be true. A simple hypothesis species a single distribution for X; a composite
hypothesis species more than one distribution for X.)
Modern hypothesis testing involves two hypotheses:
1. The null hypothesis, H0 , which represents the primary, status quo
hypothesis that we are predisposed to believe as true (a plausible explanation of the observation) unless there is evidence in the data to indicate
otherwise in which case, it will be rejected in favor of a postulated
alternative.
2. The alternative hypothesis, Ha , the carefully dened complement to H0
that we are willing to consider in replacement if H0 is rejected.
For example, the portion of Statement #1 above concerning YA may be formulated more formally as:
H0 : A = 75.5
Ha : A = 75.5

(15.1)

The implication here is that we are willing to entertain the fact that the true
value of A , the mean value of the yield obtainable from process A, is 75.5;
that any deviation of the sample data average from this value is due to purely
random variability and is not signicant (i.e., that this postulate explains
the observed data). The alternative is that any observed dierence between
the sample average and 75.5 is real and not just due to random variability;
that the alternative provides a better explanation of the data. Observe that
this alternative makes no distinction between values that are less than 75.5
or greater; so long as there is evidence that the observed sample average is
dierent from 75.5 (whether greater than or less than), H0 is to be rejected
in favor of this Ha . Under these circumstances, since the alternative admits
of values of A that can be less than 75.5 or greater than 75.5, it is called a
two-sided hypothesis.

Hypothesis Testing

555

It is also possible to formulate the problem such that the alternative actually chooses sides, for example:
H0 : A = 75.5
Ha : A < 75.5

(15.2)

In this case, when the evidence in the data does not support H0 the only other
option is that A < 75.5. Similarly, if the hypotheses are formulated instead
as:
H0 : A = 75.5
Ha : A > 75.5

(15.3)

the alternative, if the equality conjectured by the null hypothesis fails, is that
the mean must then be greater. These are one-sided hypotheses, for obvious
reasons.
A test of a statistical hypothesis is a procedure for deciding when to reject
H0 . The conclusion of a hypothesis test is either a decision to reject H0 in
favor of Ha or else to fail to reject H0 . Strictly speaking, one never actually
accepts a hypothesis; one just fails to reject it.
As one might expect, the conclusion drawn from a hypothesis test is shaped
by how Ha is framed in contrast to H0 . How to formulate the Ha appropriately
is best illustrated with an example.
Example 15.1: HYPOTHESES FORMULATION FOR COMPARING ENGINEERING TRAINING PROGRAMS
As part of an industrial training program for chemical engineers in their
junior year, some trainees are instructed by Method A, and some by
Method B. If random samples of size 10 each are taken from large groups
of trainees instructed by each of these two techniques, and each trainees
score on an appropriate achievement test is shown below, formulate a
null hypothesis H0 , and an appropriate alternative Ha , to use in testing
the claim that Method B is more eective.
Method A
Method B

71
72

75
77

65
84

69
78

73
69

66
70

68
77

71
73

74
65

68
75

Solution:
We do return to this example later to provide a solution to the problem
posed; for now, we address only the issue of formulating the hypotheses
to be tested.
Let A represent the true mean score for engineers trained by
Method A, and B , the true mean score for those trained by the other
method. The status quo postulate is to presume that there is no difference between the two methods; that any observed dierence is due
to pure chance alone. The key now is to inquire: if there is evidence in
the data that contradicts this status quo postulate, what end result are
we interested in testing this evidence against? Since the claim we are

556

Random Phenomena
interested in conrming or refuting is that Method B is more eective,
then the proper formulation of the hypotheses to be tested is as follows:
H0 : A = B
Ha : A < B

(15.4)

By formulating the problem in this fashion, any evidence that contradicts the null hypothesis will cause us to reject it in favor of something
that is actually relevant to the problem at hand.
Note that in this case specifying Ha as A = B does not help us
answer the question posed; by the same token, neither does specifying
Ha as A > B because if it is true that A < B , then the evidence in
the data will not support the alternative a circumstance which, by
default, will manifest as a misleading lack of evidence to reject H0 .

Thus, in formulating statistical hypotheses, it is customary to state H0

as the no dierence, nothing-interesting-is-happening hypothesis; the alternative, Ha , is then selected to answer the question of interest when there
is evidence in the data to contradict the null hypothesis. (See Section 15.10
below for additional discussion about this and other related issues.)
A classic illustration of these principles is the US legal system in which
a defendant is considered innocent until proven guilty. In this case, the null
hypothesis is that this defendant is no dierent from any other innocent individual; after evidence has been presented to the jury by the prosecution, the
verdict is handed down either that the defendant is guilty (i.e., rejecting the
null hypothesis) or the defendant is not guilty (i.e., failing to reject the null
hypotheses). Note that the defendant is not said to be innocent; instead,
the defendant is pronounced not guilty, which is tantamount to a decision
not to reject the null hypothesis.
Because hypotheses are statements about populations, and, as with estimation, hypothesis tests are based on nite-sized sample data, such tests are
subject to random variability and are therefore only meaningful in a probabilistic sense. This leads us to the next set of denitions and terminology.
Test Statistic, Critical Region and Signicance Level
To test a hypothesis, H0 , about a population parameter, , for a random
variable, X, against an alternative, Ha , a random sample, X1 , X2 , . . . , Xn
is acquired, from which an estimator for , say U (X1 , X2 , . . . , Xn ), is then
obtained. (Recall that U is a random variable whose specic value will vary
from sample to sample.)
A test statistic, QT (U, ), is an appropriate function of the parameter
and its estimator, U , that will be used to determine whether or not to reject
H0 . (What appropriate means will be claried shortly.) A critical region (or
rejection region), RC , is a region representing the numerical values of the test
statistic (QT > q, or QT < q, or both) that will trigger the rejection of H0 ;
i.e., if QT RC , H0 will be rejected. Strictly speaking, the critical region is

Hypothesis Testing

557

0.4
H0

f(q)

0.3

0.2

0.1
D
0.0

QT > q

FIGURE 15.1: A distribution for the null hypothesis, H0 , in terms of the test statistic,
QT , where the shaded rejection region, QT > q, indicates a signicance level,
for the random variable, X; but since the random sample from X is usually
converted to a test statistic, there is a corresponding mapping of this region
by QT (.); it is therefore acceptable to refer to the critical region in terms of
the test statistic.
Now, because the estimator U is a random variable, the test statistic will
itself also be a random variable, with the following serious implication: there is
a non-zero probability that QT RC even when H0 is true. This unavoidable
consequence of random variability forces us to design the hypothesis test such
that H0 is rejected only if it is highly unlikely for QT RC when H0 is true.
How unlikely is highly unlikely? This is quantied by specifying a value
such that
(15.5)
P (QT RC |H0 true)
with the implication that the probability of rejecting H0 , when it is in fact
true, is never greater than . This quantity, often set in advance as a small
value (typically, 0.1, 0.05, or 0.01), is called the signicance level of the test.
Thus, the signicance level of a test is the upper bound on the probability of
rejecting H0 when it is true; it determines the boundaries of the critical region
RC .
These concepts are illustrated in Fig 15.1 and lead directly to the consideration of the potential errors to which hypothesis tests are susceptible, the
associated risks, and the sensitivity of a test in leading to the correct decision.

Potential Errors, Risks, and Power

Hypothesis tests are susceptible to two types of errors:

558

Random Phenomena

TABLE 15.1:
DECISION
TRUTH
H0 True
Ha True

Hypothesis test decisions and risks

Fail to Reject
Reject
H0
H0
Correct Decision
Type I Error
Probability: (1 ) Risk:
Type II Error
Correct Decision
Risk:
Probability: (1 )

1. TYPE I error : the error of rejecting H0 when it is in fact true. This is

the legal equivalent of convicting an innocent defendant.
2. TYPE II error : the error of failing to reject H0 when it is false, the legal
equivalent of letting a guilty defendant go scotfree.
Of course a hypothesis test can also result in the correct decision in two
ways: rejecting the null hypothesis when it is false, or failing to reject the null
hypothesis when it is true.
From the denition of the critical region, RC , and the signicance level,
the probability of committing a Type I error is ; i.e.,
P (QT RC |H0 true) =

(15.6)

It is therefore called the -risk. The probability of correctly refraining from

rejecting H0 when it is true will be (1 ).
By the same token, it is possible to compute the probability of committing
a Type II error. It is customary to refer to this value as , i.e.,
/ RC |H0 false) =
P (QT

(15.7)

so that the probability of committing a Type II error is called the -risk. The
probability of correctly rejecting a null hypothesis that is false is therefore
(1 ).
It is important now to note that the two correct decisions and the probabilities associated with each one are fundamentally dierent. Primarily because
H0 is the status quo hypothesis, correctly rejecting a null hypothesis, H0 ,
that is false is of greater interest because such an outcome indicates that the
test has detected the occurrence of something signicant. Thus, (1 ), the
probability of correctly rejecting the false null hypothesis when the alternative
hypothesis is true, is known as the power of the test. It provides a measure of
the sensitivity of the test. These concepts are summarized in Table 15.1 and
also in Fig 15.2.
Sensitivity and Specicity
Because their results are binary decisions (reject H0 or fail to reject it),
hypothesis tests belong in the category of binary classication tests; and the

Hypothesis Testing

559

D
qc

FIGURE 15.2: Overlapping distributions for the null hypothesis, H0 (with mean 0 ),
and alternative hypothesis, Ha (with mean a ), showing Type I and Type II error risks
, , along with qC the boundary of the critical region of the test statistic, QT
eectiveness of such tests are characterized in terms of sensitivity and specicity. The sensitivity of a test is the percentage of true positives (in this
case, H0 deserving of rejection) that it correctly classies as such. The specicity is the percentage of true negatives (H0 that should not be rejected)
that is correctly classied as such. Sensitivity therefore measures the ability to identify true positives correctly; specicity, the ability to identify true
negatives correctly.
These performance measures are related to the risks and errors discussed
previously. If the percentages are expressed as probabilities, then sensitivity
is (1 ), and specicity, (1 ). The fraction of false positives (H0 that
should not be rejected but is) is ; the fraction of false negatives (H0 that
should be rejected but is not) is . As we show later, for a xed sample size,
improving one measure can only be achieved at the expense of the other, i.e.,
improvements in specicity must be traded o for a commensurate loss of
sensitivity, and vice versa.
The p-value
Rather than x the signicance level, , ahead of time, suppose it is free to
vary. For any given value of , let the corresponding critical/rejection region
be represented as RC (). As discussed above, H0 is rejected whenever the
test statistic, QT , is such that QT RC (). For example, from Fig 15.1, the
region RC () is the set of all values of QT that exceed the specic value q.
Observe that as decreases, the size of the set RC () also decreases, and
vice versa. The smallest value of for which the specic value of the test
statistic QT (x1 , x2 , . . . , xn ) (determined from the data set x1 , x2 , . . . , xn ) falls
in the critical region (i.e., QT (x1 , x2 , . . . , xn ) RC ()) is known as the p-value
associated with this data set (and the resulting test statistic). Technically,

560

Random Phenomena

therefore, the p-value is the smallest signicance level at which H0 will be

rejected given the observed data.
This somewhat technical denition of the p-value is sometimes easier to
understand as follows: given specic observations x1 , x2 , . . . , xn and the corresponding test statistic QT (x1 , x2 , . . . , xn ) computed from them to yield the
specic value q; the p-value associated with the observations and the corresponding test statistic is dened by the following probability statement:
p = P [QT (x1 , x2 , , xn ; ) q|H0 ]

(15.8)

In words, this is the probability of obtaining the specic test statistic value, q,
or something more extreme, if the null hypothesis is true. Note that p, being
a function of a statistic, is itself a statistic a subtle point that is often easy
to miss; the implication is that p is itself subject to purely random variability.
Knowing the p-value therefore allows us to carry out hypotheses tests at
any signicance level, without restriction to pre-specied values. In general,
a low value of p indicates that, given the evidence in the data, the null hypothesis, H0 , is highly unlikely to be true. This follows from Eq (15.8). H0 is
then rejected at the signicance level, p, which is why the p-value is sometimes
referred to as the observed signicance level observed from the sample data,
as opposed to being xed, `
a-priori, at some pre-specied value, .
Nevertheless, in many applications (especially in scientic publications),
there is an enduring traditional preference for employing xed signicance
levels (usually = 0.05). In this case, the p-value is used to make decisions
as follows: if p < , H0 will be rejected at the signicance level ; if p > ,
we fail to reject H0 at the same signicance level .

15.2.2

General Procedure

The general procedure for carrying out modern hypotheses tests is as follows:
1. Dene H0 , the hypothesis to be tested, and pair it with the alternative
Ha , formulated appropriately to answer the question at hand;
2. Obtain sample data, and from it, the test statistic relevant to the problem at hand;
3. Make a decision about H0 as follows: Either
(a) Specify the signicance level, , at which the test is to be performed, and hence determine the critical region (equivalently, the
critical value of the test statistic) that will trigger rejection; then
(b) Evaluate the specic test statistic value in relation to the critical
region and reject, or fail to reject, H0 accordingly;
or else,

Hypothesis Testing

561

(a) Compute the p-value corresponding to the test statistic, and

(b) Reject, or fail to reject, H0 accordingly on this basis.
How this general procedure is applied depends on the specic problem at hand:
the nature of the random variable, hence the underlying postulated population
itself; what is known or unknown about the population; the particular population parameter that is the subject of the test; and the nature of the question to
be answered. The remainder of this chapter is devoted to presenting the principles and mechanics of the various hypothesis tests commonly encountered in
practice, some of which are so popular that they have acquired recognizable
names (for example the z-test; t-test; 2 -test; F -test; etc). By taking time to
provide the principles along with the mechanics, our objective is to supply
the reader with the sort of information that should help to prevent the surprisingly common mistake of misapplying some of these tests. The chapter
closes with a brief discussion of some criticisms and potential shortcomings of
classical hypothesis testing.

15.3

Concerning Single Mean of a Normal Population

Let us return to the illustrative statements made earlier in this chapter

regarding the yields from two competing chemical processes. In particular, let
us recall the rst half of the statement about the yield of process A that
YA N (75.5, 1.52). Suppose that we are rst interested in testing the validity
of this statement by inquiring whether or not the true mean of the process
yield is 75.5. The starting point for this exercise is to state the null hypothesis,
which in this case is:
(15.9)
A = 75.5
since 75.5 is the specic postulated value for the unknown population mean
A . Next, we must attach an appropriate alternative hypothesis. The original statement is a categorical one that YA comes from the distribution
N (75.5, 1.52), with the hope of being able to use this statement to distinguish the YA distribution from the YB distribution. (How this latter task is
accomplished is discussed later). Thus, the only alternative we are concerned
about, should H0 prove false, is that the true mean is not equal to 75.5; we
do not care if the true mean is less than, or greater than the postulated value.
In this case, the appropriate Ha is therefore:
A = 75.5

(15.10)

Next, we need to gather evidence in the form of sample data from process
A. Such data, with n = 50, was presented in Chapter 1 (and employed in the
examples of Chapter 14), from which we have obtained a sample average,

562

Random Phenomena

yA = 75.52. And now, the question to be answered by the hypothesis test is

as follows: is the observed dierence between the postulated true population
mean, A = 75.5, and the sample average computed from sample process data,
yA = 75.52, due purely to random variation or does it indicate a real (and
signicant) dierence between postulate and data? From Chapters 13 and 14,
we now know that answering this question requires a sampling distribution
that describes the variability intrinsic to samples. In this specic case, we
obtained from a random sample of size n
know that for a sample average X
2
from a N (, ) distribution, the statistic
Z=

/ n

(15.11)

has the standard normal distribution, provided that is known. This immediately suggests, within the context of hypothesis testing, that the following
test statistic:
yA 75.5

Z=
(15.12)
1.5/ n
may be used to test the validity of the hypothesis, for any sample average
computed from any sample data set of size n. This is because we can use
Z and its pdf to determine the critical/rejection region. In particular, by
specifying a signicance level = 0.05, the rejection region is determined as
the values z such that:
RC = {z|z < z0.025 ; z > z0.025 }

(15.13)

(because this is a two-sided test). From the cumulative probability characteristics of the standard normal distribution, we obtain (using computer programs
such as MINITAB) z0.025 = 1.96 as the value of the standard normal variate
for which P (Z > z0.025 ) = 0.025, i.e.,
RC = {z|z < 1.96; z > 1.96}; or |z| > 1.96

(15.14)

The implication: if the specic value computed for Z from any sample data
set exceeds 1.96 in absolute value, H0 will be rejected.
In the specic case of yA = 75.52 and n = 50, we obtain a specic value
for this test statistic as z = 0.094. And now, because this value z = 0.094 does
not lie in the critical/rejection region dened in Eq (15.14), we conclude that
there is no evidence to reject H0 in favor of the alternative. The data does
not contradict the hypothesis.
Alternatively, we could compute the p-value associated with this test statistic (for example, using the cumulative probability feature of MINITAB):
P (z > 0.094 or z < 0.094) = P (|z| > 0.094) = 0.925

(15.15)

implying that if H0 is true, the probability of observing, by pure chance alone,

the sample average data actually observed, or something more extreme, is

Hypothesis Testing

563

very high at 0.925. Thus, there is no evidence in this data set to justify rejecting H0 . From a dierent perspective, note that this p-value is nowhere close
to being lower than the prescribed signicance level, = 0.05; we therefore
fail to reject the null hypothesis at this signicance level.
The ideas illustrated by this example can now be generalized. As with
previous discussions in Chapter 14, we organize the material according to the
status of the population standard deviation, , because whether it is known
or not determines what sampling distribution and hence test statistic is
appropriate.

15.3.1

Known; the z-test

Problem: The random variable, X, possesses a distribution, N (, 2 ), with

unknown value, , but known ; a random sample, X1 , X2 , . . . , Xn , is drawn
can be comfrom this normal population from which a sample average, X,
puted; a specic value, 0 , is hypothesized for the true population parameter;
and it is desired to test whether the sample indeed came from such a population.
The Hypotheses: In testing such a hypothesis concerning a single mean
of a normal population with known standard deviation, the null hypothesis is typically:
(15.16)
H0 : = 0
where 0 is the specic value postulated for the population mean (e.g. 75.5
used in the previous illustration). There are three possible alternative hypotheses:
Ha : < 0
(15.17)
for the lower-tailed, one-sided (or one-tailed) alternative hypothesis; or
Ha : > 0

(15.18)

for the upper-tailed, one-sided (or one-tailed) alternative hypothesis; or, nally,
as illustrated above,
Ha : = 0
(15.19)
for the two-sided (or two-tailed) alternative.
Assumptions: The underlying distribution in question is Gaussian, with
is
known standard deviation, , implying that the sampling distribution of X
2
variance, /n, if H0 is true. Hence, the
also Gaussian, with mean, 0 , and
0 )/(/ n) has a standard normal distribution,
random variable Z = (X
N (0, 1).
Test statistic: The appropriate test statistic is therefore
Z=

0
X

/ n

(15.20)

564

Random Phenomena

0.4

f(z)

0.3

0.2

0.1
D
0.0

-z
D

0
z

FIGURE 15.3: The standard normal variate z = z with tail area probability . The
shaded portion is the rejection region for a lower-tailed test, Ha : < 0

The specic value obtained for a particular sample data average, x, is sometimes called the z-score of the sample data.
Critical/Rejection Regions:
(i) For lower-tailed tests (with Ha : < 0 ), reject H0 in favor of Ha if:
z < z

(15.21)

where z is the value of the standard normal variate, z, with a tail area
probability of ; i.e., P (z > z ) = . By symmetry, P (z < z ) = P (z >
z ) = , as shown in Fig 15.3. The rationale is that if = 0 is true, then
it is highly unlikely that z will be less than z by pure chance alone; it is
more likely that is systematically less than 0 if z is less than z .
(ii) For upper-tailed tests (with Ha : > 0 ), reject H0 in favor of Ha if
(see Fig 15.4):
(15.22)
z > z
(iii) For two-sided tests, (with Ha : = 0 ), reject H0 in favor of Ha if:
z < z/2 or z > z/2

(15.23)

for the same reasons as above, because if H0 is true, then

P (z < z/2 or z > z/2 ) =

+ =
2
2

(15.24)

as illustrated in Fig 15.5.

Tests of this type are known as z-tests because of the test statistic (and
sampling distribution) upon which the test is based. Therefore,

Hypothesis Testing

565

0.4
H0

f(z)

0.3

0.2

0.1
D
0.0

0
z

FIGURE 15.4: The standard normal variate z = z with tail area probability . The
shaded portion is the rejection region for an upper-tailed test, Ha : > 0

0.4

f(z)

0.3

0.2

0.1
D

D
0.0

-z D

0
Z

z D

FIGURE 15.5: Symmetric standard normal variates z = z/2 and z = z/2 with
identical tail area probabilities /2. The shaded portions show the rejection regions for
a two-sided test, Ha : = 0

566

Random Phenomena

TABLE 15.2:

Summary of H0
conditions for the one-sample z-test
For general
Testing Against Reject H0 if:
Ha : < 0
z < z

rejection
For = 0.05
Reject H0 if:
z < 1.65

Ha : > 0

z > z

z < 1.65

Ha : = 0

z < z/2
or
z > z/2

z < 1.96
or
z > 1.96

The one-sample z-test is a hypothesis test concerning the mean of

a normal population where the population standard deviation, ,
is specied.
The key facts about the z-test for testing H0 : = 0 are summarized in
Table 15.2.
The following two examples illustrate the application of the z-test.
Example 15.2: CHARACTERIZING YIELD FROM PROCESS B
Formulate and test (at the signicance level of = 0.05) the hypothesis
implied by the second half of the statement given at the beginning of this
chapter about the mean yield of process B, i.e., that YB N (72.5, 2.52 ).
Use the data given in Chapter 1 and analyzed previously in various
Chapter 14 examples.
Solution:
In this case, as with the YA illustration used to start this section, the
hypotheses to be tested are:
H0 : B = 72.5
Ha : B = 72.5

(15.25)

a two-sided test. From the supplied data, we obtain yB = 72.47; and

since the population standard deviation, B , is given as 2.5, the specic
value, z, of the appropriate test statistic, Z (the z-score), from Eq
(15.20), is:
72.47 72.50

= 0.084
(15.26)
z=
2.5/ 50
For this two-sided test, the critical value to the right, z/2 , for = 0.05,
is:
(15.27)
z0.025 = 1.96
so that the critical/rejection region, RC , is z > 1.96 to the right, in conjunction with z < 1.96 to the left, by symmetry (recall Eq (15.14)).

Hypothesis Testing

567

And now, because the specic value z = 0.084 does not lie in the critical/rejection region, we nd no evidence to reject H0 in favor of the alternative. We conclude therefore that YB is very likely well-characterized
by the postulated distribution.
We could also compute the p-value associated with this test statistic
P (z < 0.084 or z > 0.084) = P (|z| > 0.084) = 0.933

(15.28)

with the following implication: if H0 is true, the probability of observing,

by pure chance alone, the actually observed sample average, yB = 72.47,
or something more extreme (further away from the hypothesized mean
of 72.50) is 0.933. Thus, there is no evidence to support rejecting H0 .
Furthermore, since this p-value is much higher than the prescribed signicance level, = 0.05, we cannot reject the null hypothesis at this
signicance level.

Using MINITAB
It is instructive to walk through the typical procedure for carrying out such
z-tests using computer software, in this case, MINITAB. From the MINITAB
drop down menu, the sequence Stat > Basic Statistics > 1-Sample Z
opens a dialog box that allows the user to carry out the analysis either using data already stored in MINITAB worksheet columns or from summarized
data. Since we already have summarized data, upon selecting the Summarized data option, one enters 50 into the Sample size: dialog box, 72.47
into the Mean: box, and 2.5 into the Standard deviation: box; and upon
slecting the Perform hypothesis test option, one enters 72.5 for the Hypothesized mean. The OPTIONS button allows the user to select the condence
level (the default is 95.0) and the Alternative for Ha : with the 3 available options displayed as less than, not equal, and greater than. The
MINITAB results are displayed as follows:
One-Sample Z
Test of mu = 72.5 vs not = 72.5
The assumed standard deviation = 2.5
N
Mean SE Mean
95% CI
Z
P
50 72.470
0.354
(71.777, 73.163) -0.08 0.932
This output links hypothesis testing directly with estimation (as we anticipated in Chapter 14, and as we discuss further
below) as follows: SE

Mean is the standard error of the mean (/ n) from which the 95% condence interval (shown in the MINITAB output as 95% CI) is obtained
as (71.777, 73.163). Observe that the hypothesized mean, 72.5, is contained
within this interval, with the implication that, since, at the 95% condence
level, the estimated average encompasses the hypothesized mean, we have no
reason to reject H0 at the signicance level of 0.05. The z statistic computed

568

Random Phenomena

by MINITAB is precisely what we had obtained in the example; the same is

true of the p-value.
The results of this example (and the ones obtained earlier for YA ) may now
be used to answer the rst question raised at the beginning of this chapter
(and in Chapter 1) regarding whether or not YA and YB consistently exceed
74.5.
The random variable, YA , has now been completely characterized by the
Gaussian distribution, N (75.5, 1.52), and YB by N (72.5, 2.52). From these
probability distributions, we are able to compute the following probabilities
(using MINTAB):
P (YA > 74.5) =

1 P (YA < 74.5) = 0.748

(15.29)

P (YB > 74.5) =

1 P (YB < 74.5) = 0.212

(15.30)

The sequence for calculating cumulative probabilities is as follows: Calc >

Prob Dist > Normal, which opens a dialog box for entering the desired parameters: (i) from the choices Probability density, Cumulative Probability
and Inverse Cumulative Probability, one selects the second one; Mean is
specied as 75.5 for the YA distribution, Standard deviation is specied
as 1.5; and upon entering the input constant as 74.5, MINITAB returns the
following results:
Cumulative Distribution Function
Normal with mean = 75.5 and standard deviation = 1.5
x P(X<=x)
74.5 0.252493

from which the required probability is obtained as 10.252 = 0.748. Repeating

the procedure for YB , with Mean specied as 72.5 and Standard deviation
as 2.5 produces the result shown in Eq (15.30).
The implication of these results is that process A yields will exceed 74.5%
around three-quarters of the time, whereas with the incumbent process B,
exceeding yields of 74.5% will occur only one-fths of the time. If protability
is related to yields that exceed 74.5% consistently, then process A will be
roughly 3.5 times more protable than the incumbent process B.
This next example illustrates how, in solving practical problems, intuitive reasoning without the objectivity of a formal hypothesis test can be
misleading.
Example 15.3: CHARACTERIZING FAST-ACTING RAT
POISON
The scientists at the ACME rat poison laboratories, who have been
working non-stop to develop a new fast-acting formulation that will
break the thousand-second barrier, appear to be on the verge of a
breakthrough. Their target is a product that will kill rats within 1000

Hypothesis Testing

569

secs, on average, with a standard deviation of 125 secs. Experimental tests conducted in an aliated toxicology laboratory in which pellets were made with a newly developed formulation and administered
to 64 rats (selected at random from an essentially identical population). The results showed an average acting time, x
= 1028 secs. The
ACME scientists, anxious to declare a breakthrough, were preparing
to approach management immediately to argue that the observed excess 28 secs, when compared to the stipulated standard deviation of
125 seconds, is small and insignicant. The group statistician, in an
attempt to present an objective, statistically-sound argument, recommended instead that a hypothesis test should rst be carried out to
rule out the possibility that the mean acting time is still greater than
1000 secs. Assuming that the acting time measurements are normally
distributed, carry out an appropriate hypothesis test and, at the significance level of = 0.05, make an informed recommendation regarding
the tested rat poisons acting time.
Solution:
For this problem, the null and alternative hypotheses are:
H0 :

1000

Ha :

1000

(15.31)

The alternative has been chosen this way because the concern is that
the acting time may still be greater than 1000 secs. As a result of the
normality assumption, and the fact that is specied as 125, the required test is the z-test, where the specic z-score, from Eq (15.20), in
this case is:
1028 1000

= 1.792
(15.32)
z=
125/ 64
The critical value, z , for = 0.05 for this upper-tailed one-sided test
is:
(15.33)
z0.05 = 1.65,
obtained from MINITAB using the inverse cumulative probability feature for the standard normal probability distribution with tail area probability 0.05, i.e.,
P (Z > 1.65) = 0.05
(15.34)
Thus, the rejection region, RC , is z > 1.65. And now, because z = 1.78
falls into the rejection region, the decision is to reject the null hypothesis at the 5% level. Alternatively, the p-value associated with this test
statistic can be obtained (also from MINITAB, using the cumulative
probability feature) as:
P (z > 1.792) = 0.037,

(15.35)

implying that if H0 is true, the probability of observing, by pure chance

alone, the actually observed sample average, 1028 secs, or something
higher, is so small that we are inclined to believe that H0 is unlikely to
be true. Observe that this p-value is lower than the specied signicance
level of = 0.05.

570

Random Phenomena
Thus, from these equivalent perspectives, the conclusion is that the
experimental evidence does not support the ACME scientists premature declaration of a breakthrough; the observed excess 28 secs, in fact,
appears to be signicant at the = 0.05 signicance level.

Using the procedure illustrated previously, the MINITAB results for this
problem are displayed as follows:
One-Sample Z
Test of mu = 1000 vs > 1000
The assumed standard deviation = 125
N
Mean SE Mean 95% Lower Bound
Z
P
64 1028.0
15.6
1002.3
1.79 0.037
Observe that the z- and p- values agree with what we had obtained earlier;
furthermore, the additional entries, SE Mean, for the standard error of
the mean, 15.6, and the 95% lower bound on the estimate for the mean,
1002.3, link this hypothesis test to interval estimation. This connection will
be explored more fully later in this section; for now, we note simply that the
95% lower bound on the estimate for the mean, 1002.3, lies entirely to the right
of the hypothesized mean value of 1000. The implication is that, at the 95%
condence level, it is more likely that the true mean is higher than the value
hypothesized; we are therefore more inclined to reject the null hypothesis in
favor of the alternative, at the signicance level 0.05.

15.3.2

Unknown; the t-test

When the population standard deviation, , is unknown, the sample standard deviation, s, will have to be substituted for it. In this case, one of two
things can happen:
1. If the sample size is suciently large (for example, n > 30), s is usually
considered to be a good enough approximation to , that the z-test can
be applied, treating s as equal to .
2. When the sample size is small, substituting s for changes the test
statistic and the corresponding test, as we now discuss.
For small sample sizes, when S is substituted for , the appropriate test
statistic, becomes
0
X

(15.36)
T =
S/ n
which, from our discussion of sampling distributions, is known to possess a
Students t-distribution, with = n 1 degrees of freedom. This is the small
sample size equivalent of Eq (15.20).
Once more, because of the test statistic, and the sampling distribution
upon which the test is based, this test is known as a t-test. Therefore,

Hypothesis Testing

571

TABLE 15.3:

Summary of H0
rejection conditions for the
one-sample t-test
For general
Testing Against Reject H0 if:
Ha : < 0
t < t ()
Ha : > 0

t > t ()

Ha : = 0

t < t/2 () or
t > t/2 ()
( = n 1)

The one-sample t-test is a hypothesis test concerning the mean of

a normal population when the population standard deviation, ,
is unknown, and the sample size is small.
The t-test is therefore the same as the z-test but with the sample standard
deviation, s, used in place of the unknown ; it uses the t-distribution (with the
appropriate degrees of freedom) in place of the standard normal distribution
of the z-test. The relevant facts about the t-test for testing H0 : = 0 are
summarized in Table 15.3, the equivalent of Table 15.2 shown earlier. The
specic test statistic, t, is determined by introducing sample data into Eq
15.36. Unlike the z-test, even after specifying , we are unable to determine the
specic critical/rejection region because these values depend on the degrees
of freedom (i.e., the sample size). The following example illustrates how to
conduct a one-sample t-test.
Example 15.4: HYPOTHESES TESTING REGARDING ENGINEERING TRAINING PROGRAMS
Assume that the test results shown in Example 15.1 are random samples from normal populations. (1) At a signicance level of = 0.05,
test the hypothesis that the mean score for trainees using method A
is A = 75, versus the alternative that it is less than 75. (2) Also, at
the same signicance level, test the hypothesis that the mean score for
trainees using method B is B = 75, versus the alternative that it is not.
Solution:
(1) The rst thing to note is that the population standard deviations are
not specied; and since the sample size of 10 for each data set is small,
the appropriate test is a one-sample t-test. The null and alternative
hypotheses for the rst problem are:
H0 : A = 75.0
Ha : A < 75.0

(15.37)

The sample average is obtained from the supplied data as x

A = 69.0,

572

Random Phenomena
with a sample standard deviation, sA = 4.85; the specic T statistic
value is thus obtained as:
t=

69.0 75.0

= 3.91
4.85/ 10

(15.38)

Because this is a lower-tailed, one-sided test, the critical value, t0.05 (9),
is obtained as 1.833 (using MINITABs inverse cumulative probability
feature, for the t-distribution with 9 degrees of freedom). The rejection
region, RC , is therefore t < 1.833. Observe that the specic t-value
for this test lies well within this rejection region; we therefore reject the
null hypothesis in favor of the alternative, at the signicance level 0.05.
Of course, we could also compute the p-value associated with this
particular test statistic; and from the t-distribution with 9 degrees of
freedom we obtain,
P (T (9) < 3.91) = 0.002

(15.39)

using MINITABs cumulative probability feature. The implication here

is that the probability of observing a dierence as large, or larger, between the postulated mean (75) and actual sample average (69), if H0
is true, is so very low (0.002) that it is more likely that the alternative
is true; that the sample average is more likely to have come from a
distribution whose mean is less than 75. Equivalently since this p-value
is less than the signicance level 0.05, we reject H0 at this signicance
level.
(2) The hypotheses to be tested in this case are:
H0 : B = 75.0
Ha : B = 75.0

(15.40)

From the supplied data, the sample average and standard deviation are
obtained respectively as x
B = 74.0, and sB = 5.40, so that the specic
value for the T statistic is:
t=

74 75.0
= 0.59
5.40/ 10

(15.41)

Since this is a two-tailed test, the critical values, t0.025 (9) and its mirror
image t0.025 (9), are obtained from MINITAB as:2.26 and 2.26 implying that the critical/rejection region, RC , in this case is t < 2.26
or t > 2.26. But the specic value for the t-statistic (0.59) does not
lie in this region; we therefore do not reject H0 at the signicance level
0.05.
The associated p-value, obtained from a t-distribution with 9 degrees
of freedom, is:
P (t(9) < 0.59 or t(9) > 0.59) = P (|t(9)| > 0.59) = 0.572

(15.42)

with the implication that we do not reject the null hypothesis, either
on the basis of the p-value, or else at the 0.05 signicance level, since
p = 0.572 is larger than 0.05.

Hypothesis Testing

573

Thus, observe that with these two t-tests, we have established, at a signicance
level of 0.05, that the mean score obtained by trainees using method A is less
than 75 while the mean score for trainees using method B is essentially equal
to 75. We can, of course, infer from here that this means that method B must
be more eective. But there are more direct methods for carrying out tests to
compare two means directly, which will be considered shortly.
Using MINITAB
MINITAB can be used to carry out these t-tests directly (without having to
compute, by ourselves, rst the test statistic and then the critical region, etc).
After entering the data into separate columns, Method A and Method B
in a MINITAB worksheet, for the rst problem, the sequence Stat > Basic
Statistics > 1-Sample t from the MINITAB drop down menu opens a dialog box where one selects the column containing the data, (Method A);
and upon selecting the Perform hypothesis test option, one enters the appropriate value for the Hypothesized mean (75) and with the OPTIONS
button one selects the desired Alternative for Ha (less than) along with the
default condence level (95.0).
MINITAB provides three self-explanatory graphical options: Histogram
of data; Individual value plot; and Boxplot of data. Our discussion in
Chapter 12 about graphical plots for small sample data sets recommends that,
with n = 10 in this case, the box plot is more reasonable than the histogram
for this example.
The resulting MINITAB outputs are displayed as follows:
One-Sample T: Method A
Test of mu = 75 vs < 75
95% Upper
Variable N Mean StDev SE Mean
Bound
T
P
Method A 10 69.00 4.85
1.53
71.81
-3.91 0.002
The box-plot along with the 95% condence interval estimate and the
hypothesized mean H0 = 75 are shown in Fig 15.6. The conclusion to reject
the null hypothesis in favor of the alternative is clear.
In dealing with the second problem regarding Method B, we follow the
same procedure, selecting data in the Method B column, but this time, the
Alternative is selected as not equal. The MINITAB results are displayed
as follows:
One-Sample T: Method B
Test of mu = 75 vs not = 75
Variable N Mean StDev SE Mean
95% CI
T
P
Method B 10 74.00 5.40
1.71
(70.14, 77.86) -0.59 0.572
The box-plot along with the 95% condence interval for the mean and the
hypothesized mean H0 = 75 are shown in Fig 15.7.

574

Random Phenomena

Boxplot of Method A
(with Ho and 95% t-confidence interval for the mean)

_
X
Ho

66
68
Method A

FIGURE 15.6: Box plot for Method A scores including the null hypothesis mean,
, and the 95% condence interval
H0 : = 75, shown along with the sample average, x
based on the t-distribution with 9 degrees of freedom. Note how the upper bound of the
95% condence interval lies to the left of, and does not touch, the postulated H0 value

Boxplot of Method B
(with Ho and 95% t-confidence interval for the mean)

_
X
Ho

75
Method B

FIGURE 15.7: Box plot for Method B scores including the null hypothesis mean,
, and the 95% condence interval
H0 , = 75, shown along with the sample average, x
based on the t-distribution with 9 degrees of freedom. Note how the the 95% condence
interval includes the postulated H0 value

Hypothesis Testing

15.3.3

575

Condence Intervals and Hypothesis Tests

Interval estimation techniques discussed in Chapter 14 produced estimates

for the parameter in the form of an interval, (uL < < uR ), that is expected
to contain the unknown parameter with probability (1 ); it is therefore
known as the (1 ) 100% condence interval.
Now, observe rst from the denition of the critical/rejection region, RC ,
given above, rst for a two-tailed test, that at the signicance level, , RC
is precisely complementary to the (1 ) 100% condence interval for the
estimated parameter. The implication therefore is as follows: if the postulated
population parameter (say 0 ) falls outside the (1 ) 100% condence
interval estimated from sample data, (i.e., the postulated value is higher than
the upper bound to the right, or lower than the lower bound to the left) this
triggers the rejection of H0 , that = 0 , at the signicance level of , in
favor of the alternative Ha , that = 0 . Conversely, if the postulated 0 falls
within the (1 ) 100% condence interval, we will fail to reject H0 . This
is illustrated in Example 15.2 for the mean yield of process B. The 95% condence interval was obtained as (70.74, 74.20), which fully encompasses the
hypothesized mean value of 72.5; hence we do not reject H0 at the 0.05 significance level. Similarly, in part 2 of Example 15.4, the 95% condence interval
on the average method B score was obtained as (70.14, 77.86); and with the
hypothesized mean, 75, lying entirely in this interval (as shown graphically in
Fig 15.7). Once again, we nd no evidence to reject H0 at the 0.05 signicance
level.
For an upper-tailed test (with Ha dened as Ha : > 0 ), it is the lower
bound of the (1)100% condence interval that is now of interest. Observe
that if the hypothesized value, 0 , is to the left of this lower bound (i.e., it
is lower than the lowest value of the (1 ) 100% condence interval),
the implication is twofold: (i) the computed estimate falls in the rejection
region; and, equivalently, (ii) value estimated from data is larger than the
hypothesized value both of which support the rejection of H0 in favor of
Ha , at the signicance level of . This is illustrated in Example 15.3 where the
lower bound of the estimated acting time for the rat poison was obtained
(from MINITAB) as 1002.3 secs, whereas the postulated mean is 1000. H0 is
therefore rejected at the 0.05 signicance level in favor of Ha , that the mean
value is higher. On the other hand, if the hypothesized value, 0 , is to the
right of this lower bound, there will be no support for rejecting H0 at the 0.05
signicance level.
The reverse is true for the lower-tailed test with Ha : < 0 . The upper
bound of the (1 ) 100% condence interval is of interest; and if the
hypothesized value, 0 , is to the right of this upper bound (i.e., it is larger than
the largest value of the (1 ) 100% condence interval), this hypothesized
value would have fallen into the rejection region. Because this indicates that
the value estimated from data is smaller than the hypothesized value, the
evidence supports the rejection of H0 in favor of Ha , at the 0.05 signicance

576

Random Phenomena

level. Again, this is illustrated in part 1 of Example 15.4. The upper bound
of the 95% condence interval on the average method A score was obtained
as 71.81, which is lower than the postulated average of 75, thereby triggering
the rejection of H0 in favor of Ha , at the 0.05 signicance level (see Fig 15.6).
Conversely, when the hypothesized value, 0 , is to the left of this upper bound,
we will fail to reject H0 at the 0.05 signicance level.

15.4

Concerning Two Normal Population Means

The problem of interest involves two distinct and mutually independent

normal populations, with respective unknown means 1 and 2 . In general
we are interested in making inference about the dierence between these two
means, i.e.,
1 2 =

(15.43)

The typical starting point is the null hypothesis,

H0 : 1 2 = 0

(15.44)

when the dierence between the two population means is postulated as some
value 0 , and the hypothesis is to be tested against the usual triplet of possible
alternatives:
Lower-tailed Ha : 1 2 < 0

(15.45)

Upper-tailed Ha : 1 2 > 0
Two-tailed Ha : 1 2 = 0

(15.46)
(15.47)

In particular, specifying 0 = 0 constitutes a test of equality of the two means;

but 0 does not necessarily have to be zero, allowing us to test the dierence
against any arbitrary postulated value.
As with tests of single population means, this test will be based on the
2
1 from population 1, and X
dierence between two random sample means, X
from population 2. These tests are therefore known as two-sample tests;
and, as usual, the specic test to be employed for any problem depends on
what additional information is available about each populations standard
deviation.

15.4.1

Population Standard Deviations Known

When the population standard deviations, 1 and 2 are known, we recall

(from the discussion in Chapter 14 on interval estimation of the dierence of

Hypothesis Testing

577

TABLE 15.4:

Summary of H0 rejection
conditions for the two-sample z-test
For general For = 0.05
Reject H0 if: Reject H0 if:
Testing Against
Ha : 1 2 < 0 z < z
z < 1.65
Ha : 1 2 > 0

z > z

z < 1.65

Ha : 1 2 = 0

z < z/2 or
z > z/2

z < 1.96 or
z > 1.96

two normal population means) that the test statistic:

)
1 X
(X
% 2 2 2 0 N (0, 1)
1
2
n1 + n2

(15.48)

where n1 and n2 are the sizes of the samples drawn from populations 1 and
2 respectively. This fact arises from the result established in Chapter 14 for
=X
1 X
2 as N (, v 2 ), with as dened in
the sampling distribution of D
Eq (18.10), and
2
2
(15.49)
v2 = 1 + 2
n1
n2
Tests based on this statistic are known as two-sample z-tests, and as with
previous tests, the specic results for testing H0 : 1 2 = 0 are summarized
in Table 15.4.
Let us illustrate the application of this test with the following example.
Example 15.5: COMPARISON OF SPECIALTY AUXILIARY
BACKUP LAB BATTERY LIFETIMES
A company that manufactures specialty batteries used as auxiliary backups for sensitive laboratory equipments in need of constant power supplies claims that its new prototype, brand A, has a longer lifetime (under constant use) than the industry-leading brand B, and at the same
cost. Using accepted industry protocol, a series of tests carried out in
an independent laboratory produced the following results: For brand A:
1 = 647 hrs; with a population
sample size, n1 = 40; average lifetime, x
standard deviation given as 1 = 27 hrs. The corresponding results for
2 = 638; 2 = 31. Determine, at the 5% level, if
brand B are n2 = 40; x
there is a signicant dierence between the observed mean lifetimes.
Solution:
Observe that in this case, 0 = 0, i.e., the null hypothesis is that the two
means are equal; the alternative is that 1 > 2 , so that the hypotheses
are formulated as:
H0 : 1 2 = 0
Ha : 1 2 > 0

(15.50)

578

Random Phenomena
The specic test statistic obtained from the experimental data is:
z=

(647 638) 0
%
= 1.38
2
272
+ 31
40
40

(15.51)

For this one-tailed test, the critical value, z0.05 , is 1.65; and now, since
the computed z-score is not greater than 1.65, we cannot reject the
null hypothesis. There is therefore insucient evidence to support the
rejection of H0 in favor of Ha , at the 5% signicance level.
Alternatively, we could compute the p-value and obtain:
p = P (Z > 1.38)

1 P (Z < 1.38)

1 0.916 = 0.084

(15.52)

Once again, since this p-value is greater than 0.05, we cannot reject H0
in favor of Ha , at the 5% signicance level. (However, observe that at
the 0.1 signicance level, we will reject H0 in favor of Ha , since the
p-value is less than 0.1.)

15.4.2

Population Standard Deviations Unknown

In most practical cases, it is rare that the two population standard deviations are known. Under these circumstances, we are able to identify three
distinct cases requiring dierent approaches:
1. 1 and 2 unknown; large sample sizes n1 and n2 ;
2. Small sample sizes; 1 and 2 unknown, but equal (i.e., 1 = 2 );
3. Small sample sizes; 1 and 2 unknown, and unequal (i.e., 1 = 2 ).
As usual, under the rst set of conditions, the sample standard deviations,
s1 and s2 , are considered to be suciently good approximations to the respective unknown population parameters; they are then used in place of 1
and 2 in carrying out the two-sample z-test as outlined above. Nothing more
need be said about this case. We will concentrate on the remaining two cases
where the sample sizes are considered to be small.
Equal standard deviations
When the two population standard deviations are considered as equal, the
test statistic:
1 X
2 ) 0
(X
t()
(15.53)
T = % 2
Sp
Sp2
+
n1
n2
i.e., its sampling distribution is a t-distribution with degrees of freedom,
with
(15.54)
= n1 + n2 2

Hypothesis Testing

579

TABLE 15.5:

Summary of H0
rejection conditions for the two-sample
t-test
For general
Reject H0 if:
Testing Against
Ha : 1 2 < 0 t < t ()
Ha : 1 2 > 0

t > t ()

Ha : 1 2 = 0

t < t/2 () or
t > t/2 ()
( = n1 + n2 2)

Here, Sp is the pooled sample standard deviation obtained as the positive

square root of the pooled sample variance a weighted average of the two
sample variances:
(n1 1)S12 + (n2 1)S22
,
(15.55)
Sp2 =
n1 + n2 2
a reasonable estimate of the (equal) population variances based on the two
sample variances.
From this test statistic and its sampling distribution, one can now carry
out the two-sample t-test, and, once more, the specic results for testing
H0 : 1 2 = 0 against various alternatives are summarized in Table 15.5.
The following example illustrates these results.
Example 15.6: HYPOTHESES TEST COMPARING EFFECTIVENESS OF ENGINEERING TRAINING PROGRAMS
Revisit the problem in Example 15.1 and this time, at the 5% signicance level, test the claim that Method B is more eective. Assume that
the scores shown in Example 1 come from normal populations with potentially dierent means, but equal variances.
Solution:
In this case, because the sample size is small for each data set, the appropriate test is a two-sample t-test, with equal variance; the hypotheses
to be tested are:
H0 : A B = 0
Ha : A B < 0

(15.56)

Care must be taken in ensuring that Ha is specied properly. Since the

claim is that Method B is more eective, if the dierence in the means
is specied in H0 as shown (with A rst), then the appropriate Ha
is as we have specied. (We are perfectly at liberty to formulate H0
dierently, with B rst, in which case the alternative hypothesis must
change to Ha : B A > 0.)

580

Random Phenomena
From the sample data, we obtain all the quantities required for comB = 74.0; the
puting the test statistic: the sample means, x
A = 69.0, x
sample standard deviations, sA = 4.85, sB = 5.40; so that the estimated
pooled standard deviation is obtained as:
sp = 5.13
with = 18. To test the observed dierence (d = 69.0 74.0 = 5.0)
against a hypothesized dierence of 0 = 0 (i.e., equality of the means),
we obtain the t-statistic as:
t = 2.18,
which is compared to the critical value for a t-distribution with 18 degrees of freedom,
t0.05 (18) = 1.73.
And since t < t0.05 (18), we reject the null hypothesis in favor of the
alternative, and conclude that, at the 5% signicance level, the evidence
in the data supports the claim that Method B is more eective.
Note also that the associated p-value, obtained from a t distribution
with 18 degrees of freedom, is:
P (t(18) < 2.18) = 0.021

(15.57)

which, by virtue of being less than 0.05 recommends rejection of H0 in

favor of Ha , at the 5% signicance level, as we already concluded above.

Using MINITAB
This just-concluded example illustrates the mechanics of how to conduct
a two-sample t-test manually; once the mechanics are understood, however,
it is recommended to use computer programs such as MINITAB.
As noted before, once the data sets have been entered into separate
columns Method A and Method B in a MINITAB worksheet (as was the
case in the previous Example 15.4), the required sequence from the MINITAB
drop down menu is: Stat > Basic Statistics > 2-Sample t, which opens
a dialog box with self-explanatory options. Once the location of the relevant
data are identied, the Assume equal variance box is selected in this case,
and with the OPTIONS button, one selects the Alternative for Ha (less
than, if the hypotheses are set up as we have done above), along with the
default condence level (95.0); one enters the value for hypothesized dierence, 0 , in the Test dierence box (0 in this case). The resulting MINITAB
outputs for this problem are displayed as follows:
Two-Sample T-Test and CI: Method A, Method B
Two-sample T for Method A vs Method B
N Mean StDev SE Mean
Method A 10 69.00 4.85
1.5
Method B 10 74.00 5.40
1.7

Hypothesis Testing

581

Difference = mu (Method A) - mu (Method B)

Estimate for difference: -5.00
95% upper bound for difference: -1.02
T-Test of difference = 0 (vs <): T-Value = -2.18 P-Value = 0.021 DF
= 18
Both use Pooled StDev = 5.1316
Unequal standard deviations
When 1 = 2 , things become a bit more complicated, and a detailed
discussion lies outside the intended scope of this book. Suce it to say that
under these circumstances, the universally recommended test statistic is T
dened as:
2 ) 0
1 X
(X
T = % 2
,
(15.58)
S1
S22
+
n1
n2
which appears deceptively like Eq (15.53), with the very important dierence
that S1 and S2 have been reinstated individually in place of the pooled Sp .
Of course, this expression is also reminiscent of the Z statistic in Eq (15.48),
with S1 and S2 introduced in place of the population variances. However,
unlike the other single variable cases where such a substitution transforms
the standard normal sampling distribution to the t-distribution with the appropriate degrees of freedom, unfortunately, this time, this test statistic only
has an approximate (not exact) t-distribution; and the degrees of freedom, ,
accompanying this approximate t-distribution is dened by:
=n
12 2

(15.59)

with n
12 dened by the formidable-looking expression

S 2 /n + S 2 /n 2
1
2
2
n
12 = (S 21/n )2
1 1 + (S22 /n2 )2
n1 +1

(15.60)

n2 +1

rounded to the nearest integer.

Under these conditions, the specic results for carrying out two-sample ttests for testing H0 : 1 2 = 0 against various alternatives are summarized
in Table 15.5 but with t in place of the corresponding t-values, and using
given above in Eqs (15.59) and (15.60) for the degrees of freedom.
Although it is possible to carry out such two-sample t-tests manually by
computing the required quantities on our own, it is highly recommended that
such tests be carried out using computer programs such as MINITAB.
Condence Intervals and Two-Sample tests
The relationship between condence intervals for the dierence between
two normal population means and the two-sample tests discussed above perfectly mirrors the earlier discussion concerning single means of a normal population. For the two-sided test, a (1 ) 100% condence interval estimate for

582

Random Phenomena

the dierence between the two means that does not contain the hypothesized
mean corresponds to a hypothesis test in which H0 is rejected, at the significance level of , in favor of the alternative that the computed dierence is
not equal to the hypothesized dierence. Note that with a test of equality (in
which case 0 , the hypothesized dierence, is 0), rejection of H0 is tantamount
to the (1 ) 100% condence interval for the dierence not containing 0.
On the contrary, an estimated (1)100% condence interval that contains
the hypothesized dierence is equivalent to a two-sample test that must fail
to reject H0 .
The corresponding arguments for the upper-tailed and lower-tailed tests
follow precisely as presented earlier. For an upper-tailed test, (Ha : > 0 ), a
lower bound of the (1)100% condence interval estimate of the dierence,
, that is larger than the hypothesized dierence, 0 , corresponds to a twosample test in which H0 is rejected in favor of Ha , at the signicance level
of . Conversely, a lower bound of the condence interval estimate of the
dierence, , that is smaller than the hypothesized dierence, 0 , corresponds
to a test that will not reject H0 . The reverse is the case for the lower-tailed
test (Ha : < 0 ): when the upper bound of the (1 ) 100% condence
interval estimate of is smaller than 0 , H0 is rejected in favor of Ha . When
the upper bound of the (1 ) 100% condence interval estimate of is
larger than 0 , H0 is not rejected.
An Illustrative Example: The Yield Improvement Problem
The solution to the yield improvement problem rst posed in Chapter 1,
and revisited at the beginning of this chapter, will nally be completed in
this illustrative example. In addition, the example also illustrates the use of
MINITAB to carry out a two-sample t-test when population variances are not
equal.
The following questions remain to be resolved: Is YA > YB , and if so, is
YA YB > 2? Having already conrmed that the random variables, YA and YB ,
2
can be characterized reasonably well with Gaussian distributions, N (A , A
)
2
and N (B , B ), respectively, the supplied data may then be considered as
being from normal distributions with unequal population variances. We will
answer these two questions by carrying out appropriate two-sample t-tests.
Although the answer to the rst of the two questions requires testing for
the equality of A and B against the alternative that A > B , let us begin by
rst testing against A = B ; this establishes that the two distributions means
are dierent. Later we will test against the alternative that A > B , and
thereby go beyond the mere existence of a dierence between the population
means to establish which is larger. Finally, we proceed even one step further
to establish not only which one is larger, but that it is larger by a value that
exceeds a certain postulated value (in this case 2).
For the rst test of basic equality, the hypothesized dierence is clearly

Hypothesis Testing

583

0 = 0, so that:
H0 : A B = 0
Ha : A = B = 0

(15.61)

The procedure for using MINITAB is as follows: upon entering the data
into separate YA and YB columns in a MINITAB worksheet, the required
sequence from the MINITAB drop down menu is: Stat > Basic Statistics
> 2-Sample t. In the opened dialog box, one simply selects the Samples in
dierent columns option, identies the columns corresponding to each data
set, but this time, the Assume equal variance box must not be selected.
With the OPTIONS button one selects the Alternative for Ha as not
equal, along with the default condence level (95.0); in the Test dierence
box, one enters the value for hypothesized dierence, 0 ; 0 in this case. The
resulting MINITAB outputs for this problem are displayed as follows:
Two-Sample T-Test and CI: YA, YB
Two-sample T for YA vs YB
N Mean StDev SE Mean
YA 50 75.52 1.43
0.20
YB 50 72.47 2.76
0.39
Difference = mu (YA) - mu (YB)
Estimate for difference: 3.047
95% CI for difference: (2.169, 3.924)
T-Test of difference = 0 (vs not =): T-Value = 6.92 P-Value = 0.000
DF = 73
Several points are worth noting here:
1. The most important is the p-value which is virtually zero; the implication
is that at the 0.05 signicance level, we must reject the null hypothesis
in favor of the alternative: the two population means are in fact different, i.e., the observed dierence between the population is not zero.
Note also that the t-statistic value is 6.92, a truly extreme value for a
distribution that is symmetrical about the value 0, and for which the
density value, f (t) essentially vanishes (i.e., f (t) 0), for values of the
t variate exceeding 4. The p-value is obtained as P (|T | > 6.92).
2. The estimated sample dierence is 3.047, with a 95% condence interval,
(2.169, 3.924); since this interval does not contain the hypothesized difference 0 = 0, the implication is that the test will reject H0 , as indeed
we have concluded in point #1 above;
3. Finally, even though there were 50 data entries each for YA and YB , the
degrees of freedom associated with this test is obtained as 73. (See the
expressions in Eqs (15.59) and (15.60) above.)

584

Random Phenomena

This rst test has therefore established that the means of the YA and YB
populations are dierent, at the 5% signicance level. Next, we wish to test
which of these two dierent means is larger. To do this, the hypotheses to be
tested are:
H0 : A B = 0
Ha : A > B = 0

(15.62)

The resulting outputs from MINITAB are identical to what is shown above
for the rst test, except that the 95% CI for difference line is replaced with 95% lower bound for difference: 2.313 and the T-Test of
difference = 0 (vs not =) is replaced with T-Test of difference = 0
(vs >). The t-value, p-value and DF remain the same.
Again, with a p-value that is virtually zero, the conclusion is that, at the
5% signicance level, the null hypothesis must be rejected in favor of the
alternative, which, this time, is specically that A is greater than B . Note
that the value 2.313, computed from the data as the 95% lower bound for the
dierence, is considerably higher than the hypothesized value of 0; i.e., the
hypothesized 0 = 0 lies well to the left of this lower bound for the dierence.
This is consistent with rejecting the null hypothesis in favor of the alternative,
at the 5% signicance level.
With the nal test, we wish to sharpen the postulated dierence a bit
further. This time, we assert that, A is not only greater than B ; the former
is in fact greater than the latter by a value that exceeds 2. The hypotheses
are set up in this case as follows:
H0 : A B = 2
Ha : A > B = 2

(15.63)

This time, in the MINTAB options, the new hypothesized dierence is indicated as 2 in the Test dierence box. The MINITAB results are displayed
as follows:
Two-Sample T-Test and CI: YA, YB
Two-sample T for YA vs YB
N Mean StDev SE Mean
YA 50 75.52 1.43
0.20
YB 50 72.47 2.76
0.39
Difference = mu (YA) - mu (YB)
Estimate for difference: 3.047
95% lower bound for difference: 2.313
T-Test of difference = 2 (vs >): T-Value = 2.38 P-Value = 0.010 DF
= 73
Note that the t-value is now 2.38 (reecting the new hypothesized value of

Hypothesis Testing

585

0 = 2), with the immediate consequence that the p-value is now 0.01; not
surprisingly, everything else remains the same as in the rst test. Thus, at the
0.05 signicance level, we reject the null hypothesis in favor of the alternative. Note also that the 95% lower bound for the dierence is larger than the
hypothesized dierence of 2.
The conclusion is therefore that, with 95% condence (or alternatively at
a signicance level of 0.05), the mean yield obtainable from the challenger
process A is at least 2 points larger than that obtainable by the incumbent
process B.

15.4.3

Paired Dierences

A subtle but important variation on the theme of inference concerning

two normal population means arises when the data naturally occur in pairs, as
with the data shown in Table 15.6. This is a record of the before and after
weights (in pounds) of twenty patients enrolled in a clinically-supervised 10week weight-loss program. Several important characteristics set this problem
apart from the general two-sample problem:
1. For each patient, the random variable Weight naturally occurs as an
ordered pair of random variables (X, Y ), with X as the Before weight,
and Y as the After weight;
2. As a result, it is highly unlikely that the two entries per patient will be
totally independent, i.e., the random sample, X1 , X2 , . . . , Xn , will likely
not be independent of Y1 , Y2 , . . . , Yn ;
3. In addition, the sample sizes for each random sample, X1 , X2 , . . . , Xn ,
and Y1 , Y2 , . . . , Yn , by denition, will be identical;
4. Finally, it is quite possible that the patient-to-patient variability in each
random variable X or Y (i.e., the variability within each group) may
be much larger than the dierence between the groups that we seek to
detect.
These circumstances call for a dierent approach, especially in light of item
#2 above, which invalidates one of the most crucial assumptions underlying
the two-sample tests: independence of the random samples.
The analysis for this class of problems proceeds as follows: Let (Xi , Yi ); i =
1, 2, . . . , n, be an ordered pair of random samples, where X1 , X2 , . . . , Xn
2
is from a normal population with mean, X , and variance, X
; and
Y1 , Y2 , . . . , Yn , a random sample from a normal population with mean, Y ,
and variance, Y2 . Dene the dierence D as:
Di = Xi Yi

(15.64)

then, Di , i = 1, 2, . . . , n, constitutes a random sample of dierences with mean

value,
(15.65)
= X Y

586

Random Phenomena

TABLE 15.6:

Before and After weights for patients on a supervised

weight-loss program
Patient #
Before Wt (lbs)
After Wt (lbs)
Patient #
Before Wt (lbs)
After Wt (lbs)

1
272
263
11
215
206

2
319
313
12
245
235

3
253
251
13
248
237

4
325
312
14
364
350

5
236
227
15
301
288

6
233
227
16
203
195

7
300
290
17
197
193

8
260
251
18
217
216

9
268
262
19
210
202

10
276
263
20
223
214

The quantities required for the hypothesis test are: the sample average,
n
Di

D = i=1
,
(15.66)
n
(which is unbiased for ), and the sample variance of the dierences,
n
2
(Di D)
2
SD
.
(15.67)
= i=1
n1
Under these circumstances, the null hypothesis is dened as
H0 : = 0

(15.68)

when , the dierence between the paired observations, is postulated as some

value 0 . This hypothesis, as usual, is to be tested against the possible alternatives
Lower-tailed Ha : < 0

(15.69)

Upper-tailed Ha : > 0
Two-tailed Ha : = 0

(15.70)
(15.71)

The appropriate test statistic is

T =

0
D
;
SD / n

(15.72)

it possesses a t(n 1) distribution. When used to carry out what is generally

known as the paired t-test, the results are similar to those obtained for
earlier tests, with the specic rejection conditions summarized in Table 15.7.
The next two examples illustrate the importance of distinguishing between a
paired-test and a general two-sample test.
Example 15.7: WEIGHT-LOSS DATA ANALYSIS: PART 1
By treating the weight-loss patient data in Table 15.6 as before and
after ordered pairs, determine at the 5% level, whether or not the
weight loss program has been eective in assisting patients lose weight.

Hypothesis Testing

587

TABLE 15.7:

Summary of H0
rejection conditions for the paired
t-test
For general
Testing Against Reject H0 if:
Ha : < 0
t < t ()
Ha : > 0

t > t ()

Ha : = 0

t < t/2 () or
t > t/2 ()
( = n 1)

Solution:
This problem requires determining whether the mean dierence between
the before and after weights for the 20 patients is signicantly different from zero. The null and alternative hypotheses are:
H0 : = 0
Ha : = 0

(15.73)

We can compute the twenty before-minus-after weight dierences,

obtain the sample average and sample standard deviation of these differences, and then compute the t-statistic from Eq (15.72) for 0 = 0.
How this t statistic compares against the critical value of t0.025 (19) will
determine whether or not to reject the null hypothesis.
We can also use MINITAB directly. After entering the data into two
columns Before WT and After WT, the sequence: Stat > Basic
Statistics > Paired t opens the usual analysis dialog box: as with
other hypothesis tests, data columns are identied, and with the OPTIONS button, the Alternative for Ha is selected asnot equal, along
with 0 for the Test mean value, with the default condence level
(95.0). The resulting MINITAB outputs for this problem are displayed
as follows:
Paired T-Test and CI: Before WT, After WT
Paired T for Before WT - After WT
N Mean StDev SE Mean
Before WT
20 258.2
45.2
10.1
After WT
20 249.9
43.3
9.7
Difference 20 8.400 3.662
0.819
95% CI for mean difference: (6.686, 10.114)
T-Test of mean difference = 0 (vs not = 0): T-Value = 10.26 P-Value
= 0.000
The mean dierence (i.e., average weight-loss per patient) is 8.4 lbs, and
the 95% condence interval (6.686, 10.114), does not contain 0; also, the
p-value is 0 (to three decimal places). The implication is therefore that at

588

Random Phenomena
Boxplot of Differences
(with Ho and 95% t-confidence interval for the mean)

_
X
Ho

6
8
Differences

FIGURE 15.8: Box plot of dierences between the Before and After weights,
including a 95% condence interval for the mean dierence, and the hypothesized H0
point, 0 = 0
the signicance level of 0.05, we reject the null hypothesis and conclude
that the weight-loss program was eective. The average weight loss of 8.4
lbs is therefore signicantly dierent from zero, at the 5% signicance
level.
A box plot of the dierences between the before and after
weights is shown in Fig 15.8, which displays graphically that the null
hypothesis should be rejected in favor of the alternative. Note how far
the hypothesized value of 0 is from the 95% condence interval for the
mean weight dierence.

The next example illustrates the consequences of wrongly employing a

two-sample t-test for this natural paired t-test problem.
Example 15.7: WEIGHT-LOSS DATA ANALYSIS: PART 2:
TWO-SAMPLE T-TEST
Revisit the problem in Example 15.6 but this time treat the before
and after weight data in Table 15.6 as if they were independent samples from two dierent normal populations; carry out a 2-sample t-test
and, at the 5% level, determine whether or not the two sample means
are dierent.
Solution:
First let us be very clear: this is not the right thing to do; but if a
2-sample t-test is carried out on this data set with the hypotheses as:
H0 : bef ore af ter = 0
Ha : bef ore af ter = 0

(15.74)

Hypothesis Testing

589

MINITAB produces the following result:

Two-Sample T-Test and CI: Before WT, After WT
Two-sample T for Before WT vs After WT
N Mean StDev SE Mean
Before WT 20 258.2
45.2
10.1
After WT
20 249.9
43.3
9.7
Difference = mu (Before WT) - mu (After WT)
Estimate for difference: 8.4
95% CI for difference: (-20.0, 36.8)
T-Test of difference = 0 (vs not =): T-Value = 0.60 P-Value =
0.552 DF = 38
Both use Pooled StDev = 44.2957
With a t-value of 0.6 and a p-value of 0.552, this analysis indicates
that there is no evidence to support rejecting the null hypothesis at the
signicance level of 0.05. The estimated dierence of the means is 8.4
(the same as the mean of the dierences obtained in Example 15.6);
but because of the large pooled standard deviation, the 95% condence
interval is (-20.0, 36.8), which includes 0. As a result, the null hypothesis
cannot be rejected at the 5% signicance level in favor of the alternative.
This, of course, will be the wrong decision (as the previous example has
shown) and should serve as a warning against using the two-sample
t-test improperly for paired data.

It is important to understand the sources of the failure in this last example. First, a box plot of the two data sets, shown in Fig 15.9, graphically
illustrates why the two-sample t-test is entirely unable to detect the very real,
and very signicant, dierence between the before and after weights. The
variability within the samples is so high that it swamps out the dierence between each pair which is actually signicant. But the most important reason
is illustrated in Fig 15.10, which shows a plot of before and after weights
for each patient versus patient number, from where it is absolutely clear, that
the two sets of weights are almost perfectly correlated. Paired data are often
not independent. Observe from the data (and from this graph) that without
exception, every single before weight is higher than the corresponding after weight. The issue is therefore not whether there is a weight loss; it is
a question of how much. For this group of patients, however, this dierence
cannot be detected in the midst of the large amount of variability within each
group (before or after).
These are the primary reasons that the two-sample t-test failed miserably
in identifying a dierential that is quite signicant. (As an exercise, the reader
should obtain a scatter plot of the before weight versus the after weight
to provide further graphical evidence of just how correlated the two weights
are.)

590

Random Phenomena

380
360
340
320

Data

300
280
260
240
220
200
Before WT

After WT

FIGURE 15.9: Box plot of the Before and After weights including individual data
means. Notice the wide range of each data set

380

Variable
Before WT
After WT

360
340

Weight

320
300
280
260
240
220
200
2

10
12
Patient #

FIGURE 15.10: A plot of the Before and After weights for each patient. Note
how one data sequence is almost perfectly correlated with the other; in addition note the
relatively large variability intrinsic in each data set compared to the dierence between
each point

Hypothesis Testing

15.5

591

Determining , Power, and Sample Size

Determining , the Type II error risk, and hence (1 ), the power of any
hypothesis test, depends on whether the test is one- or two-sided. The same
is also true of the complementary problem: the determination of experimental
sample sizes required to achieve a certain pre-specied power. We begin our
discussion of such issues with the one-sided test, specically the upper-tailed
test, with the null hypothesis as in Eq (15.16) and the alternative in Eq
(15.18). The results for the lower-tailed, and the two-sided tests, which follow
similarly, will be given without detailed derivations.

15.5.1

and Power

To determine (and hence power) for the upper-tailed test, it is not sucient merely to state that > 0 ; instead, one must specify a particular value
for the alternative mean, say a , so that:
Ha : = a > 0

(15.75)

is the alternative hypothesis. The Type II error risk is therefore the probability
of failing to reject the null hypothesis when in truth the data came from
the alternative distribution with mean a (where, for the upper-tailed test,
a > 0 ).
The dierence between this alternative and the postulated null hypothesis
distribution mean,
(15.76)
= a 0
is the margin by which the null hypothesis is falsied in comparison to the
alternative. As one might expect, the magnitude of will be a factor in how
easy or dicult it is for the test to detect, amidst all the variability in the
data, a dierence between H0 and Ha , and therefore correctly reject H0 when
it is false. (Equivalently, the magnitude of will also factor into the risk of
incorrectly failing to reject H0 in favor of a true Ha .)
As shown earlier, if H0 is true, then the distribution of the sample mean,
is N (0 , 2 /n), so that the test statistic, Z, in Eq (15.20), possesses a
X,
standard normal distribution; i.e.,
Z=

0
X
N (0, 1)
/ n

(15.77)

However, if Ha is true, then in fact the more appropriate distribution for X

= a under these circumstances, not
is N (a , 2 /n). And now, because E(X)
0 as postulated, the most important implication is that the distributional
characteristics of the computed Z statistic, instead of following the standard

592

Random Phenomena

H0; N(0,1)

0.4

Ha; N(2.5,1)

f(z)

0.3

0.2
1/2

*n
G
V

0.1

0.0

-4

-2

1.65 2

FIGURE 15.11: Null and alternative hypotheses distributions for upper-tailed test
based on n = 25 observations, with population standard deviation = 4, where the
true alternative mean, a , exceeds the hypothesized one by = 2.0. The gure shows a

z-shift of ( n)/ = 2.5; and with reference to H0 , the critical value z0.05 = 1.65.
The area under the H0 curve to the right of the point z = 1.65 is = 0.05, the
signicance level; the area under the dashed Ha curve to the left of the point z = 1.65
is

normal distribution, will be:

,1
/ n

(15.78)

i.e., the standard normal distribution

shifted to the right (for this upper
tailed test) by a factor of ( n)/. Thus, as a result of a true dierential, ,
between alternative and null hypothesized means, the standardized alternative
distribution will show a z-shift

n
(15.79)
zshif t =

For example, for a test based on 25 observations, with population standard

deviation = 4 where the true alternative mean, a , exceeds the hypothesized
one by = 2.0, the mean value of the standardized alternative distribution,
following Eq (15.78), will be 2.5, and the two distributions will be as shown in
Fig 15.11, with the alternative hypothesis distribution shown with the dashed
line.
In terms of the standard normal variate, z, under H0 , the shifted variate
under the alternative hypothesis, Ha , is:

n
=z
(15.80)

Hypothesis Testing

593

And now, to compute , we recall that, by denition,

= P (z < z |Ha )
which, by virtue of the z-shift translates to:

n
= P z < z
,

from where we obtain the expression for the power of the test as:

n
(1 ) = 1 P z < z

(15.81)

(15.82)

(15.83)

Thus, for the illustrative example test given above, based on 25 observations, with = 4 and a 0 = = 2.0, the -risk and power are obtained
as

Power

= P (z < 1.65 2.5) = 0.198

= (1 ) = 0.802

(15.84)

as shown in Fig 15.12.

15.5.2

Sample Size

In the same way in which z was dened earlier, let z be the standard
normal variate such that
(15.85)
P (z > z ) =
so that, by symmetry,
P (z < z ) =
Then, from Eqs (15.82) and (15.86) we obtain :

n
z = z

which rearranges to give the important expression,

n
z + z =

(15.86)

(15.87)

(15.88)

which relates the - and -risk variates to the three hypothesis test characteristics: , the hypothesized mean shift to be detected by the test (the
signal); , the population standard deviation, a measure of the variability
inherent in the data (the noise); and nally, n, the number of experimental
observations to be used to carry out the hypothesis test (the sample size).
(Note that these three terms comprise what we earlier referred to as the zshift, the precise amount by which the standardized Ha distribution has been
shifted away from the H0 distribution: see Fig 15.11.)

594

Random Phenomena

0.4

f(x)

0.3

Ha; N(2.5,1)

0.2

0.1
0.198
0.0

1.65

2.5
X

0.4

Ha; N(2.5,1)

f(x)

0.3

0.802
0.2

0.1

0.0

1.65

2.5
X

FIGURE 15.12: and power values for hypothesis test of Fig 15.11 with Ha
N (2.5, 1). Top:; Bottom: Power = (1 )

Hypothesis Testing

595

This relationship, fundamental to power and sample size analyses, can

also be derived in terms of the unscaled critical value, xC , which marks the
boundary of the rejection region for the unscaled sample mean.
Observe that by denition of the signicance level, , the critical value,
and the Z statistic,
xC 0

(15.89)
z =
/ n
so that:

xC = z + 0
(15.90)
n
By denition of , under Ha ,

xC a

=P z<
(15.91)
/ n
and from the denition of the z variate in Eq (15.86), we obtain:
z =

xC a

/ n

(15.92)

and upon substituting Eq (15.90) in for xC , and recalling that a 0 = ,

Eq (15.92) immediately reduces to

n
, or
z = z

n
(15.93)
z + z =

as obtained earlier from the standardized distributions.

Several important characteristics of hypothesis tests are embedded in this
important expression that are worth drawing out explicitly; but rst, a general statement regarding z-variates and risks. Observe that any tail area, ,
decreases as |z | increases; similarly, tail area, , increases as z decreases;
similarly, tail area, , increases as |z | decreases. We may thus note the following about Eq (15.93):
1. The equation shows that for any particular hypothesis test with xed
characteristics , , and n, there is a conservation of the sum of the and -risk variates; if z increases, z must decrease by a commensurate
amount, and vice versa.
2. Consequently, if, in order to reduce the -risk, z is increased, z will
decrease commensurately to maintain the left-hand side sum constant,
with the result that the -risk must automatically increase. The reverse
is also true: increasing z for the purpose of reducing the -risk will
result in z decreasing to match the increase in z , so that the risk
will then increase. Therefore, for a xed set of test characteristics, the
associated Type I and Type II risks are such that a reduction in one risk
will result in an increase in the other in mutual fashion.

596

Random Phenomena

3. The only way to reduce either risk simultaneously (which will require increasing the total sum of the risk variates) is by increasing the z-shift.
This is achievable most directly by increasing n, the sample size, since
neither , the population standard deviation, nor , the hypothesized
mean shift to be detected by the test, is usually under the direct control
of the experimenter.
This last point leads directly to the issue of determining how many experimental samples are required to attain a certain power, given basic test
characteristics. This question is answered by solving Eq (15.88) explicitly for
n to obtain:
2

(z + z )
(15.94)
n=

Thus, by specifying the desired - and -risks along with the test characteristics, , the hypothesized mean shift to be detected by the test, and
, the population standard deviation, one can use Eq (15.94) to determine
the sample size required to achieve the desired risk levels. In particular, it
is customary to specify the risks as = 0.05 and = 0.10, in which case,
z = z0.05 = 1.645; and z = z0.10 = 1.28. Eq (15.94) then reduces to:

n=

2.925

2
(15.95)

from which, given and , one can determine n.

Example 15.8: SAMPLE SIZE REQUIRED TO IMPROVE
POWER OF HYPOTHESIS TEST
The upper-tailed hypothesis test illustrated in Fig 15.11 was shown in
Eq (15.84) to have a power of 0.802 (equivalent to a -risk of 0.182). It
is based on a sample size of n = 25 observations, population standard
deviation = 4, and where the true alternative mean a exceeds the
hypothesized one by = 2.0. Determine the sample size required to
improve the power from 0.802 to the customary 0.9.
Solution:
Upon substituting = 4; = 2 into Eq (15.95), we immediately obtain n = 34.2, which should be rounded up to the nearest integer to
yield 35. This is the required sample size, an increase of 10 additional
observations. To compute the actual power obtained with n = 35 (since
it is technically dierent from the precise, but impractical, n = 34.2
obtained from Eq (15.95)), we introduce n = 35 in Eq (15.94) and obtain the corresponding z as 1.308; from here we may obtain from
MINITABs cumulative probability feature as = 0.095, and hence
Power = 1 = 0.905
is the actual power.

(15.96)

Hypothesis Testing

597

Practical Considerations
In practice, prior to performing the actual hypothesis test, no one knows
whether or not Ha is true compared to H0 talk less of knowing the precise
amount by which a will exceed the postulated 0 if Ha turns out to be true.
The implication therefore is that is never known in an objective fashion
a-priori. In determining the power of a hypothesis test, therefore, is treated
`
not as known but as a design parameter : the minimum dierence we would
like to detect, if such a dierence exists. Thus, is to be considered properly as the magnitude of the smallest dierence we wish to detect with the
hypothesis test.
In a somewhat related vein, the population standard deviation, , is rarely
known `
a priori in many practical cases. Under these circumstances, it has
often been recommended to use educated guesses, or results from prior experiments under similar circumstances, to provide pragmatic surrogates for
. We strongly recommend an alternative approach: casting the problem in
terms of the signal-to-noise ratio (SNR):
SN =

(15.97)

a ratio of the magnitude of the signal (dierence in the means) to be detected

and the intrinsic noise (population standard deviation) in the midst of which
the signal is to be detected. In this case, the general equation (15.94), and the
more specic Eq (15.95) become:

n =
n =

(z + z )
SN

2
2.925
SN

(15.98)

Without necessarily knowing either or independently, the experimenter

then makes a sample-size decision by designing for a test to handle a design
SNR.
Example 15.9: SAMPLE SIZE TABLE FOR VARIOUS
SIGNAL-TO-NOISE RATIOS: POWER OF 0.9
Obtain a table of sample sizes required to achieve a power of 0.9 for
various signal-to-noise ratios from 0.3 to 1.5.
Solution:
Table 15.8 is generated from Eq (15.98) for the indicated values of the
signal-to-noise ratio, where n+ is the value of the computed n rounded
up to the nearest integer. As expected, as the signal-to-noise ratio improves, the sample size required to achieve a power of 0.9 reduces; fewer
data points are required to detect signals that are large relative to the
standard deviation. Note in particular that for the example considered

598

Random Phenomena

TABLE 15.8:

Sample size n required to achieve a power of 0.9 for various

values of signal-to-noise ratio, SN
SN
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0 1.2 1.5
n
95.06 53.47 34.22 23.77 17.46 13.37 10.56 8.56 5.94 3.80
96
53
35
24
18
14
11
9
6
4
n+

in Fig 15.11 and Example 15.8, SN = 2/4 = 0.5; from this Table 15.8,
the required sample size, 35, is precisely as obtained in Example 15.8.

15.5.3

and Power for Lower-Tailed and Two-Sided Tests

For the sake of clarity, the preceding discussion was specically restricted
to the upper-tailed test. Now that we have presented and illustrated the essential concepts, it is relatively straightforward to extend them to other types
of tests without having to repeat the details.
First, because the sampling distribution for the test statistic employed for
these hypothesis tests is symmetric, it is easy to see that with the lower-tailed
alternative
(15.99)
Ha : = a < 0
this time, with
= 0 a ,

(15.100)

n
= P z > z +

(15.101)

the risk is obtained as:

the equivalent of Eq (15.82), from where the power is obtained as (1 ).

Again, because of symmetry, it is easy to see that the expression for determining sample size is precisely the same as derived earlier for the upper tailed
test: i.e.,
2

(z + z )
n=

All other results therefore follow.

For the two-tailed test, things are somewhat dierent, of course, but the
same principles apply. The risk is determined from:

n
n
= P z < z/2
P z < z/2
(15.102)

because of the two-sided rejection region. Unfortunately, as a result of the

additional term in this equation, there is no closed-form solution for n that is

Hypothesis Testing

the equivalent of Eq (15.94). When P z < z/2
imation,
2

(z/2 + z )
n

599

, the approx(15.103)

is usually good enough. Of course, given the test characteristics, computer

programs can solve for n precisely in Eq (15.102) without the need to resort
to the approximation shown here.

15.5.4

General Power and Sample Size Considerations

For general power and sample size considerations, it is typical to start by

specifying and ; as a result, in either Eq (15.94) for one-tailed tests, or Eq
(15.103) for the two-sided test, this leaves 3 parameters to be determined: ,
n, and z . By specifying any two, a value for the third unspecied parameter
that is consistent with the given information can be computed from these
equations.
In MINITAB the sequence required for carrying out this procedure is: Stat
> Power and Sample Size which produces a drop down menu containing a
collection of hypothesis tests (and experimental designs see later). Upon
selecting the hypothesis test of interest, a dialog box opens, with the instruction to Specify values for any two of the following, with three appropriately labeled spaces for Sample size(s), Dierence(s), and Power
value(s). The Options button is used to specify the alternative hypothesis
and the -risk value. The value of the unspecied third parameter is then
computed by MINITAB.
The following example illustrates this procedure.
Example 15.10: POWER AND SAMPLE SIZE DETERMINATION USING MINITAB
Use MINITAB to compute power and sample size for an upper-tailed,
one sample z-test, with = 4, designed to detect a dierence of 2, at
the signicance level of = 0.05: (1) if n = 25, determine the resulting
power; (2) when the power is desired to be 0.9, determine the required
sample size. (3) With a sample size of n = 35, determine the minimum
dierence that can be detected with a power of 0.9.
Solution:
(1) Upon entering the given parameters into the appropriate boxes in
the MINITAB dialog box, and upon choosing the appropriate alternative hypothesis, the MINITAB result is shown below:
Power and Sample Size
1-Sample Z Test
Testing mean = null (versus > null)
Calculating power for mean = null + difference

600

Random Phenomena
Alpha = 0.05 Assumed standard deviation = 4
Sample
Difference Size
Power
2
25
0.803765
This computed power value is what we had obtained earlier.
(2) When the power is specied and the sample size removed, the
MINITAB result is:
Power and Sample Size
1-Sample Z Test
Testing mean = null (versus > null)
Calculating power for mean = null + difference
Alpha = 0.05 Assumed standard deviation = 4
Sample Target
Difference
Size
Power Actual Power
2
35
0.9
0.905440
This is exactly the same sample size value and the same actual power
value we had obtained earlier.
(3) With n specied as 35 and the dierence unspecied, the
MINITAB result is:
Power and Sample Size
1-Sample Z Test
Testing mean = null (versus > null)
Calculating power for mean = null + difference
Alpha = 0.05 Assumed standard deviation = 4
Sample Target
Difference
Size
Power Difference
2
35
0.9
1.97861
The implication is that any dierence greater than 1.98 can be detected
at the desired power. A dierence of 2.0 is therefore detectable at a
power that is at least 0.9.
These results are all consistent with what we had obtained earlier.

15.6

Concerning Variances of Normal Populations

The discussions up until now have focused exclusively on hypothesis tests

concerning the means of normal populations. But if we recall, for example,
the earlier statements made regarding, say, the yield of process A, that YA
N (75.5, 1.52), we see that in this statement is a companion assertion about the

Hypothesis Testing

601

TABLE 15.9:

Summary of H0
rejection conditions for the 2 -test
For general
Testing Against Reject H0 if:
Ha : 2 < 02
c2 < 21 (n 1)
Ha : 2 > 02

c2 > 2 (n 1)

Ha : 2 = 02

c2 < 21/2 (n 1)
or
c2 > 2/2 (n 1)

associated variance. To conrm or refute this statement completely requires

testing the validity of the assertion about the variance also.
There are two classes of tests concerning variances of normal population:
the rst concerns testing the variance obtained from a sample against a postulated population variance (as is the case here with YA ); the second concerns
testing two (independent) normal populations for equality of their variances.
We shall now deal with each case.

15.6.1

Single Variance

When the variance of a sample is to be tested against a postulated value,

02 , the null hypothesis is:
(15.104)
H0 : 2 = 02
Under the assumption that the sample in question came from a normal population, then the test statistic:
C2 =

(n 1)S 2
02

(15.105)

has a 2 (n 1) distribution, if H0 is true. As a result, this test is known as

a chi-squared test; and the rejection criteria for the usual triplet of alternatives is shown in Table 15.9. The reader should note the lack of symmetry
in the boundaries of these rejection regions when compared with the symmetric boundaries for the corresponding z- and t-tests. This, of course, is a
consequence of the asymmetry of the 2 (n 1) distribution. For example,
for one-sided tests based on 10 samples from a normal distribution, the null
hypothesis distributions for C 2 is shown in Fig 15.13.
The next example is used to illustrate a two-sided test.
Example 15.11: VARIANCE OF PROCESS A YIELD
Formulate and test an appropriate hypothesis, at the signicance level
of 0.05, regarding the variance of the yield obtainable from process A
implied by the assertion that the sample data presented in Chapter 1

602

Random Phenomena

0.10

f(x)

0.08
Chi-Square, df=9

0.06

0.04

0.02
D
0.00

F
D
16.9

0.10

0.08

f(x)

Chi-Square, df=9
0.06

0.04

0.02

0.00

3.33

FIGURE 15.13: Rejection regions for one-sided tests of a single variance of a normal
population, at a signicance level of = 0.05, based on n = 10 samples. The distribution
is 2 (9); Top: for Ha : 2 > 02 , indicating rejection of H0 if c2 > 2 (9) = 16.9; Bottom:
for Ha : 2 < 02 , indicating rejection of H0 if c2 < 21 (9) = 3.33

Hypothesis Testing

603

for YA is from a normal population with the distribution N (75.5, 1.52 ).

Solution:
2
= 1.52 against the alternative
The hypothesis to be tested is that A
that it is not; i.e.,:
H0 :

2
A
= 1.52

Ha :

2
A
= 1.52 ,

(15.106)

The sample variance computed from the supplied data is

that the specic value for the 2 test statistic is:
c2 =

49 2.05
= 44.63
2.25

s2A

= 2.05, so

(15.107)

The rejection region for this two-sided test, with = 0.05, is shown
in Fig 15.14, for a 2 (49) distribution. The boundaries of the rejection
region are obtained from the usual cumulative probabilities; the left
boundary is obtained by nding 21/2 such that
P (c2 > 21/2 (49))
2

or P (c <

0.975

21/2 (49))

0.025

21/2

31.6

i.e.,

(15.108)

and the right boundary from:

P (c2 > 2/2 (49))
2

or P (c <

0.025

2/2 (49))

0.975

2/2

70.2

i.e.,

(15.109)

Since the value for c above does not fall into this rejection region, we
do not reject the null hypothesis.
As before, MINITAB could be used directly to carry out this test.
The self-explanatory procedure follows along the same lines as those
discussed extensively above.
The conclusion: at the 5% signicance level, we cannot reject the
2
.
null hypothesis concerning A

15.6.2

Two Variances

When two variances from mutually independent normal populations are

to be compared, the null hypothesis is:
H0 : 12 = 22

(15.110)

If the samples (sizes n1 and n2 respectively) come from independent normal

distributions, then the test statistic:
F =

S12
S22

(15.111)

604

Random Phenomena

0.04

f(x)

0.03
Chi-Square, df=49
0.02

0.01
0.025
0.00

0.025
31.6

70.2

FIGURE 15.14: Rejection regions for the two-sided tests concerning the variance of

2
= 1.52 , based on n = 50 samples, at a signicance
the process A yield data H0 : A
level of = 0.05. The distribution is 2 (49), with the rejection region shaded; because
the test statistic, c2 = 44.63, falls outside of the rejection region, we do not reject H0 .

has an F (1 , 2 ) distribution, where 1 = (n1 1) and 2 = (n2 1), if

H0 is true. Such tests are therefore known as F -tests. As with other tests,
the rejection regions are determined from the F -distribution with appropriate
degrees-of-freedom pairs on the basis of the desired signicance level, . These
are shown in Table 15.10.
It is often helpful in carrying out F -tests to recall the following property
of the F -distribution:
F1 (1 , 2 ) =

1
F (2 , 1 )

(15.112)

an easy enough relationship to prove directly from the denition of the F -

TABLE 15.10:

Summary of H0
rejection conditions for the F -test
For general
Testing Against Reject H0 if:
Ha : 12 < 22
f < F1 (1 , 2 )
Ha : 12 > 22

f > F (1 , 2 )

Ha : 12 = 22

f < F1/2 (1 , 2 )
or
f > F/2 (1 , 2 )

Hypothesis Testing

605

statistic in Eq (15.111). This relationship makes it possible to reduce the

number of entries in old-fashioned F -tables. As we have repeatedly advocated
in this chapter, however, it is most advisable to use computer programs for
carrying out such tests.
Example 15.12: COMPARING VARIANCES OF YIELDS
FROM PROCESSES A AND B
From the data supplied in Chapter 1 on the yields obtained from the two
chemical processes A and B, test a hypothesis on the potential equality
of these variances, at the 5% signicance level.
Solution:
2
2
= B
against the alternative
The hypothesis to be tested is that A
that it is not; i.e.,:
2
2
= B
H 0 : A
2
2
H a : A
= B

(15.113)

From the supplied data, we obtain s2A = 2.05, and s2B = 7.62, so that
the specic value for the F -test statistic is obtained as:
f=

2.05
= 0.27
7.62

(15.114)

The rejection region for this two-sided F -test, with = 0.05, is shown in
Fig 15.15, for an F (49, 49) distribution, with boundaries at f = 0.567
to the left and 1.76 to the right, obtained as usual from cumulative
probabilities. (Note that the value of f at one boundary is the reciprocal
of the value at the other boundary.) Since the specic test value, 0.27,
falls in the left side of the rejection region, we must therefore reject the
null hypothesis in favor of the alternative that these two variances are
unequal.
The self-explanatory procedure for carrying out the test in
MINITAB generates results that include a p-value of 0.000, in agreement with the conclusion above to reject the null hypothesis at the 5%
signicance level.

The F -test is particularly useful for ascertaining whether or not the assumption of equality of variances is valid before performing a two-sample t-test. If
the null hypothesis regarding the equality assumption is rejected, then one
must not use the equal variance option of the test. If one is unable to reject
the null hypothesis, one may proceed to use the equal variance option. As
discussed in subsequent chapters, the F -test is also at the heart of ANOVA
(ANalysis Of VAriance), a methodology that is central to much of statistical design of experiments and the systematic analysis of the resulting data
statistical tests involving several means, and even regression analysis.
Finally, we note that the F -test is quite sensitive to the normality assumption: if this assumption is invalid, the test results will be unreliable. Note that
the assumption of normality is not about the mean of the data but about

606

Random Phenomena

1.6
1.4
1.2

f(x)

1.0
0.8
F, df1=49, df2=49

0.6
0.4
0.2
0.0

0.025

0.025
0.567

1.76

FIGURE 15.15: Rejection regions for the two-sided tests of the equality of the vari-

2
2
= B
, at a signicance
ances of the process A and process B yield data, i.e., H0 : A
level of = 0.05, based on n = 50 samples each. The distribution is F (49, 49), with
the rejection region shaded; since the test statistic, f = 0.27, falls within the rejection
region to the left, we reject H0 in favor of Ha .

the raw data set itself. One must therefore be careful to ensure that this normality assumption is reasonable before carrying out an F -test. If the data is
from non-normal distributions, most computer programs provide alternatives
(based on non-parametric methods).

15.7

Concerning Proportions

As noted at the beginning of this chapter, a statistical hypothesis, in the

most fundamental sense, is an assertion or statement about one or more populations; and the hypothesis test provides an objective means of ascertaining
the truth or falsity of such a statement. So far, our discussions have centered
essentially around normal populations because a vast majority of practical
problems are of this form, or can be safely approximated as such. However,
not all problems of practical importance involve sampling from normal populations; and the next section will broach this topic from a more general
perspective. For now, we want to consider rst a particular important class
of problems involving sampling from a non-Gaussian population: hypotheses
concerning proportions.
The general theoretical characteristics of problems of this kind were stud-

Hypothesis Testing

607

ied extensively in Chapter 8. Out of a total number of n samples examined

for a particular attribute, X is the total number of (discrete) observations
sharing the attribute in question; X/n is therefore the observed sample proportion sharing the attribute. Theoretically, the random variable, X, is known
to follow the binomial distribution, characterized by the parameter p, the theoretical population proportion sharing the attribute (also known as the probability of success). Statements about such proportions are therefore statistical
hypotheses concerning samples from binomial populations. Market/opinion
surveys (such as the example used to open Chapter 14) where the proportion
preferring a certain brand is of interest, and manufacturing processes where
the concern is the proportion of defective products, provide the prototypical
examples of problems of this nature. Hypotheses about the probability of successful embryo implantation in in-vitro fertilization (discussed in Chapter 7),
or any other such binomial process probability, also fall into this category.
We deal rst with hypotheses concerning single population proportions,
and then hypotheses concerning two proportions. The underlying principles
remain the same as with other tests: nd the appropriate test statistic and its
sampling distribution, and, given a specic signicance level, use these to make
probabilistic statements that will allow the determination of the appropriate
rejection region.

15.7.1

Single Population Proportion

The problem of interest involves testing a hypothesis concerning a single

binomial population proportion, p, given a sample of n items from which one
observes X successes (the same as the detection of the attribute in question);
the null hypothesis is:
(15.115)
H0 : p = p 0
with p0 as the specic value postulated for the population proportion. The
usual three possible alternative hypotheses are:
Ha : p < p 0

(15.116)

Ha : p > p 0

(15.117)

Ha : p = p0 .

(15.118)

and
To determine an appropriate test statistic and its sampling distribution,
we need to recall several characteristics of the binomial random variable from
Chapter 8. First, the estimator, , dened as:
=

X
,
n

(15.119)

the mean number of successes, is unbiased for the binomial population parameter; the mean of the sampling distribution for is therefore p. Next, the

608

Random Phenomena

TABLE 15.11:

Summary of H0 rejection
conditions for the single-proportion z-test
For general For = 0.05
Testing Against Reject H0 if: Reject H0 if:
Ha : p < p 0
z < z
z < 1.65
Ha : p > p 0

z > z

z < 1.65

Ha : p = p0

z < z/2
or
z > z/2

z < 1.96
or
z > 1.96

2
variance of is X
/n2 , where
2
X
= npq = np(1 p)

(15.120)

is the variance of the binomial random variable, X. Hence,

2
=

p(1 p)
n

(15.121)

Large Sample Approximations

From the Central Limit Theorem we know that, in the limit as n , the
sampling distribution of the mean of any population (including the binomial)
tends to the normal distribution. The implication is that the statistic, Z,
dened as:
X
p
Z= n
(15.122)
p(1 p)/n
has an approximate standard normal, N (0, 1), distribution for large n.
The test statistic for carrying out the hypothesis test in Eq (15.115) versus
any of the three alternatives is therefore:
Z=

p0
p0 (1 p0 )/n

(15.123)

a test statistic with precisely the same properties as those used for the standard z-test. The rejection conditions are identical to those shown in Table
15.2, which, when modied appropriately for the one-proportion test, is as
shown in Table 15.11.
Since this test is predicated upon the sample being suciently large, it
is important to ensure that this is indeed the case. A generally agreed upon
objective criterion for ascertaining the validity of this approximation is that
the interval

(15.124)
I0 = p0 3 [p0 (1 p0 )]/n
does not include 0 or 1. The next example illustrates these concepts.

Hypothesis Testing

609

Example 15.13: EXAM TYPE PREFERENCE OF UNDERGRADUATE CHEMICAL ENGINEERING STUDENTS

In the opening sections of Chapter 14, we reported the result of an
opinion poll of 100 undergraduate chemical engineering students in
the United States: 75 of the students prefer closed-book exams to
opened-book ones. At the 5% signicance level, test the hypothesis
that the true proportion preferring closed-book exams is in fact 0.8,
against the alternative that it is not.
Solution:
If the sample size is conrmed to be large enough, then this is
a single
proportion test which employs the z-statistic. The interval
p0 3 [p0 (1 p0 )]/n in this case is 0.8 0.12, or (0.68, 0.92), which
does not include 0 or 1; the sample size is therefore considered to be
suciently large.
The hypothesis to be tested is therefore the two-sided
H0 : p

0.8

Ha : p

0.8;

(15.125)

the z-statistic in this case is:

0.75 0.8
= 1.25
(0.8 0.2)/100

(15.126)

Since this value does not lie in the two-sided rejection region for =
0.05, we do not reject the null hypothesis.

MINITAB could be used to tackle this example problem directly. The

self-explanatory sequence (when one chooses the "use test and interval
based on normal distribution" option) produces the following result:
Test and CI for One Proportion
Test of p = 0.8 vs p not = 0.8
Sample X
N Sample p
95% CI
Z-Value P-Value
1
75 100 0.750000 (0.665131, 0.834869)
-1.25
0.211
Using the normal approximation.
As with similar tests discussed earlier, we see here that the 95% condence
interval for the parameter, p, contains the postulated p0 = 0.8; the associated
p-value for the test (an unfortunate and unavoidable notational clumsiness
that we trust will not confuse the reader unduly1 ) is 0.211, so that we do not
reject H0 at the 5% signicance level.
1 The latter p of the p-value should not be confused with the binomial probability of
success parameter

610

Random Phenomena

Exact Tests
Even though it is customary to invoke the normal approximation in dealing with tests for single proportions, this is in fact not necessary. The reason
is quite simple: if X Bi(n, p), then = X/n has a Bi(n, p/n) distribution. This fact can be used to compute the probability that = p0 , or any
other value providing the means for determining the boundaries of the
various rejection regions, (given desired tail area probabilities), just as with
the standard normal distribution, or any other standardized test distribution.
Computer programs such as MINITAB provide options for obtaining exact
p-values for the single proportion test that are based on exact binomial distributions.
When MINITAB is used to carry out the test in Example 15.13 above,
this time without invoking the normal approximation option, the result is as
follows:
Test and CI for One Proportion
Test of p = 0.8 vs p not = 0.8
Sample X
N Sample p
1
75 100 0.750000

Exact
95% CI
P-Value
(0.653448, 0.831220)
0.260

The 95% condence interval, which is now based on a binomial distribution, not a normal approximation, is now slightly dierent; the p-value, is also
now slightly dierent, but the conclusion remains the same.

15.7.2

Two Population Proportions

In comparing two population proportions, p1 and p2 , as with the 2-sample

tests of means from normal populations, the null hypothesis is:
H0 : 1 2 = 0

(15.127)

where 1 = X1 /n1 and 2 = X2 /n2 are, respectively, the random proportions of successes obtained from population 1 and population 2, based on
samples of respective sizes n1 and n2 . For example, 1 could be the fraction of
defective chips in a sample of n1 chips manufactured at one facility whose true
proportion of defectives is p1 , while 2 is the defective fraction contained in
a sample from a dierent facility. The dierence between the two population
proportions is postulated as some value 0 that need not be zero.
As usual, the hypothesis is to be tested against the possible alternatives:
Lower-tailed Ha : 1 2 < 0

(15.128)

Upper-tailed Ha : 1 2 > 0
Two-tailed Ha : 1 2 = 0

(15.129)
(15.130)

As before, 0 = 0 constitutes a test of equality of the two proportions.

Hypothesis Testing

611

To obtain an appropriate test statistic and its sampling distribution, we

begin by dening:
D = 1 2
(15.131)
We know in general that
E(D ) =
D

D = p1 p2
&

p1 q1
p2 q2
+
n1
n2

(15.132)
(15.133)

But now, if the sample sizes n1 and n2 are large, then it can be shown that
2
)
D N (D , D

(15.134)

again allowing us to invoke the normal approximation (for large sample sizes).
This immediately implies that the following is an appropriate test statistic to
use for this two-proportion test:
(1 2 ) 0
Z = )

N (0, 1)
p1 q1
p2 q2
+
n1
n2

(15.135)

Since population values, p1 and p2 , are seldom available in practice, it is

customary to substitute sample estimates,
p1 =

x1
x2
; and p2 =
n1
n2

(15.136)

Finally, since this test statistic possesses a standard normal distribution, the
rejection regions are precisely the same as those in Table 15.4.
In the special case when 0 = 0, which is equivalent to a test of equality of
the proportions, the most important consequence is that if the null hypothesis
is true, then p1 = p2 = p, which is then estimated by the pooled proportion:
p =

x1 + x2
n1 + n2

(15.137)

As a result, the standard deviation of the dierence in proportions, D ,

becomes:
&

&

1
p1 q1
p2 q2
1
D =
+
+
pq
(15.138)
n1
n2
n1
n2
so that the test statistic in Eq (15.135) is modied to
(1 2 )
Z= )
N (0, 1)
pq n11 + n12
The rejection regions are the same as in the general case.

(15.139)

612

Random Phenomena
Example 15.14: REGIONAL PREFERENCE FOR PEPSI
To conrm persistent rumors that the preference for PEPSI on engineering college campuses is higher in the Northeast of the United States
than on comparable campuses in the Southeast, a survey was carried out
on 125 engineering students chosen at random on the MIT campus in
Cambridge, MA, and the same number of engineering students selected
at random at Georgia Tech in Atlanta, GA. Each student was asked
to indicate a preference for PEPSI versus other soft drinks, with the
following results: 44 of the 125 at MIT indicate preference for PEPSI
versus 26 at GA Tech. At the 5% level, determine whether the Northeast proportion, p1 = 0.352, is essentially the same as the Southeast
proportion, p2 = 0.208, against the alternative that they are dierent.
Solution:
The hypotheses to be tested are:
H0 : 1 2

Ha : 1 2

(15.140)

and from the given data, the test statistic computed from Eq (15.139)
is z = 2.54. Since this number is greater than 1.96, and therefore lies in
the rejection region of the two-sided test, we reject the null hypothesis
in favor of the alternative. Using MINITAB to carry out this test, selecting the "use pooled estimate of p for test," produces the following result:
Test and
Sample
1
2

CI for Two Proportions

X
N
Sample p
44 125 0.352000
26 125 0.208000

Difference = p (1) - p (2)

Estimate for difference: 0.144
95% CI for difference: (0.0341256, 0.253874)
Test for difference = 0 (vs not = 0): Z = 2.54 P-Value = 0.011
Note that the 95% condence interval around the estimated dierence of 0.144 does not include zero; the p-value associated with the test
is 0.011 which is less than 0.05; hence, we reject the null hypothesis at
the 5% signicance level.

As an exercise, the reader should extend this example by testing 0 = 0.02

against the alternative that the dierence is greater than 0.02.

Hypothesis Testing

15.8

613

Concerning Non-Gaussian Populations

The discussion in the previous section has opened up the issue of testing
hypotheses about non-Gaussian populations, and has provided a strategy for
handling such problems in general. The central issue is nding an appropriate
test statistic and its sampling distribution, as was done for the binomial distribution. This cause is advanced greatly by the relationship between interval
estimates and hypothesis tests (discussed earlier in Section 15.3.3) and by the
discussion at the end of Chapter 14 on interval estimates for non-Gaussian
distributions.

15.8.1

Large Sample Test for Means

First, if the statistical hypothesis is about the mean of a non-Gaussian population, so long as the sample size, n, used to compute the sample average,
is reasonably large (e.g. n > 30 or so), then, regardless of the underlying
X,
)/X possesses an apdistribution, we known that the statistic Z = (X
proximate standard normal distribution an approximation that improves
as n . Thus, hypotheses about the means of non-Gaussian populations
that are based on large sample sizes are essentially the same as z-tests.
Example 15.15: HYPOTHESIS TEST ON MEAN OF INCLUSIONS DATA
If the data in Table 1.2 is considered a random sample of 60 observations
of the number of inclusions found on glass sheets produced in the manufacturing process discussed in Chapter 1, test at the 5% signicance
level, the hypothesis that this data came from a Poisson population with
mean = 1, against the alternative that is not 1.
Solution:
The hypotheses to be tested are:
H0 :

Ha :

(15.141)

While the data is from a Poisson population, the sample size is large;
hence, the test statistic:
0
X

(15.142)
Z=
/ 60

where is the standard deviation of the raw data (so that / 60 is the
standard deviation of the sample average), essentially has a standard
normal distribution.
=x
From the supplied data, we obtain the sample average
= 1.02,
with the sample standard deviation, s = 1.1, which, because of the large
sample, will be considered to be a reasonable approximation of . The

614

Random Phenomena
test statistic is therefore obtained as z = 0.141. Since this value is not in
the two-sided rejection region |z| > 1.96 for = 0.05, we do not reject
the null hypothesis. We therefore conclude that there is no evidence to
contradict the statement that X P(1), i.e., the inclusions data is from
a Poisson population with mean number of inclusions = 1.

It is now important to recall the results in Example 14.13 where the 95%
condence interval estimate for the mean of the inclusions data was obtained
as:

(15.143)
= 1.02 1.96(1.1/ 60) = 1.02 0.28
i.e., 0.74 < < 1.30. Note that this interval contains the hypothesized value
= 1.0, indicating that we cannot reject the null hypothesis.
We can now use this result to answer the following question raised in Chapter 1 as a result of the potentially disturbing data obtained from the quality
control lab apparently indicating too many glass sheets with too many inclusions: if the process was designed to produce glass sheets with a mean number
of inclusions = 1 per m2 , is there evidence in this sample data that the
process has changed, that the number of observed inclusions is signicantly
dierent from what one can reasonably expect from the process when operating
as designed?
From the results of this example, the answer is, No: at the 5% signicance
level, there no evidence that the process has deviated from its design target.

15.8.2

Small Sample Tests

When the sample size on which the sample average is based is small, or
when we are dealing with aspects of the population other than the mean, (say
the variance), we are left with only one option: go back to rst principles,
derive the sampling distribution for the appropriate statistic and use it to
carry out the required test. One can use the sampling distribution to determine
100% rejection regions, or the complementary region, the (1 )100%
condence interval estimates for the appropriate parameter.
For tests involving single parameters, it makes no dierence which of these
two approaches we choose; for tests involving two parameters, however, it is
more straightforward to compute condence intervals for the parameters in
question and then use these for the hypothesis test. The reason is that for tests
involving two parameters, condence intervals can be computed directly from
the individual sampling distributions; on the other hand, computing rejection
regions for the dierence between these two parameters technically requires an
additional step of deriving yet another sampling distribution for the dierence.
And the sampling distribution of the dierence between two random variables
may not always be easy to derive. Having discussed earlier in this chapter the
equivalence between condence intervals and hypothesis tests, we now note
that for non-Gaussian problems, one might as well just base the hypotheses
tests on (1 )100% condence intervals and avoid the additional hassle of

Hypothesis Testing

615

having to derive distributions for dierences. Let us illustrate this concept with
a problem involving the exponential random variable discussed in Chapter 14.
In Example 14.3, we presented a problem involving an exponential random
variable, the waiting time (in days) until the occurrence of a recordable safety
incident in a certain companys manufacturing site. The safety performance
data for the rst and second years were presented, from which point estimates
of the unknown population parameter, , were determined from the sample
2 = 32.9 days for Year 2; the sample
averages, x
1 = 30.1 days, for Year 1 and x
size in each case is n = 10, which is considered small.
To test the two-sided hypothesis that these two safety performance parameters (Year 1 versus Year 2) are the same, versus the alternative that they are
signicantly dierent (at the 5% signicance level), we proceed as follows: we
2 given that X E(); we
1 and X
rst obtain the sampling distribution for X
then use these to obtain 95% condence interval estimates for the population
means i for Year i; if these intervals overlap, then at the 5% signicance
level, we cannot reject the null hypothesis that these means are the same; if
the intervals do not overlap, we reject the null hypothesis.
Much of this, of course, was already accomplished in Example 14.14:
i /i
i has a gamma distribution, more specically, X
we showed that X
(n, 1/n), from where we obtain 95% condence intervals estimates for i from
sample data. In particular, for n = 10, we obtained from the Gamma(10,0.1)
distribution that:

X
< 1.71 = 0.95
(15.144)
P 0.48 <

which, upon introducing x1 = 30.1, and x2 = 32.9, produces, upon careful

rearrangement, the 95% condence interval estimates for the Year 1 and Year
2 parameters respectively as:
17.6 < 1 < 62.71

(15.145)

19.24 < 2 < 68.54

(15.146)

These intervals may now be used to answer a wide array of questions

regarding hypotheses concerning two parameters, even questions concerning
a single parameter. For instance,
1. For the two-parameter null hypothesis, H0 : 1 = 2 , versus Ha : 1 =
2 , because the 95% condence intervals overlap considerably, we nd
no evidence to reject H0 at the 5% signicance level.
2. In addition, the single parameter null hypothesis, H0 : 1 = 40, versus
Ha : 1 = 40, cannot be rejected at the 5% signicance level because the
postulated value is contained in the 95% condence interval for 1 ; on
the contrary, the null hypothesis H0 : 1 = 15, versus Ha : 1 = 15 will
be rejected at the 5% signicance level because the hypothesized value
falls outside of the 95% condence interval (i.e., falls in the rejection
region).

616

Random Phenomena

3. Similarly, the null hypothesis H0 : 2 = 40, versus Ha : 2 = 40, cannot

be rejected at the 5% signicance level because the postulated value is
contained in the 95% condence interval for 2 ; on the other hand, the
null hypothesis H0 : 2 = 17, versus Ha : 2 = 17 will be rejected at the
5% signicance level the hypothesized value falls outside of the 95%
condence interval (i.e., falls in the rejection region).
The principles illustrated here can be applied to any non-Gaussian population provided the sampling distribution of the statistic in question can be
determined.
Another technique for dealing with populations characterized by any general pdf (Gaussian or not), and based on the maximum likelihood principle
discussed in Chapter 14 for estimating unknown population parameters, is
discussed next in its own separate section.

15.9

Likelihood Ratio Tests

In its broadest sense, a likelihood ratio (LR) test is a technique for assessing
how well a simpler, restricted version of a probability model compares to its
more complex, unrestricted version in explaining observed data. Within the
context of this current chapter, however, the discussion here will be limited to
testing hypotheses about the parameters, , of a population characterized by
the pdf f (x, ). Even though based on fundamentally dierent premises, some
of the most popular tests considered above (the z- and t-tests, for example)
are equivalent to LR tests under recognizable conditions.

15.9.1

General Principles

Let X be a random variable with the pdf f (x, ), where the population
parameter vector ; i.e., represents the set of possible values that the
parameter vector can take. Given a random sample, X1 , X2 , . . . , Xn , estimation theory, as discussed in Chapter 14, is concerned with using such sample
information to determine reasonable estimates for . In particular, we recall
that the maximum likelihood (ML) principle requires choosing the estimate,
ML , as the value of that maximizes the likelihood function:

L() = f1 (x1 , )f2 (x2 , ) fn (xn , )

(15.147)

the joint pdf of the random sample, treated as a function of the unknown
population parameter.
The same random sample and the same ML principle can be used to test
the null hypotheses
(15.148)
H0 : 0

Hypothesis Testing

617

stated in a more general fashion in which is restricted to a certain range

of values, 0 (a subset of ), over which H0 is hypothesized to be valid.
For example, to test a hypothesis about the mean of X by postulating that
X N (75, 1.52), in this current context, , the full set of possible parameter
vales, is dened as follows:
= {(1 , 2 ) : < 1 = < ; 2 = 2 = 1.52 }

(15.149)

since the variance is given and the only unknown parameter is the mean; 0 ,
the restricted parameter set range over which H0 is conjectured to be valid,
is dened as:
0 = {(1 , 2 ) : 1 = 0 = 75; 2 = 2 = 1.52 }

(15.150)

The null hypothesis in Eq (15.148) is to be tested against the alternative:

Ha : a

(15.151)

again stated in a general fashion in which the parameter set, a , is (a) disjoint
from 0 , and (b) also complementary to it, in the sense that
= 0 a

(15.152)

For example, the two-sided alternative to the hypothesis above regarding X

N (75, 1.52) translates to:
a = {(1 , 2 ) : 1 = 0 = 75; 2 = 2 = 1.52 }

(15.153)

Note that the union of this set with 0 in Eq (15.150) is the full parameter
set range, in Eq (15.149).
Now, dene the largest likelihood under H0 as
L (0 ) = max L()
0

(15.154)

and the unrestricted maximum likelihood value as:

L () =
Then the ratio:
=

max

0 a

L (0 )
L ()

L()

(15.155)

(15.156)

is known as the likelihood ratio; it possesses some characteristics that make it

attractive for carrying out general hypothesis tests. But rst, we note that by
denition, L () is the maximum value achieved by the likelihood function
ML . Also, is a random variable (it depends on the random
when =
sample, X1 , X2 , . . . , Xn ); this is why it is sometimes called the likelihood ratio
test statistic. When specic data values, x1 , x2 , . . . , xn , are introduced into Eq
(15.156), the result is a specic value, , for the likelihood ratio such that
0 1, for the following reasons:

618

Random Phenomena

1. 0. This is because each likelihood function contributing to the ratio

is a pdf (joint pdfs, but pdfs nonetheless), and each legitimate pdf is
such that f (x, ) > 0;
2. 1. This is because 0 ; consequently, since L () is the largest
achievable value of the likelihood function in the entire unrestricted set
, the largest likelihood value achieved in the subset 0 , L (0 ), will
be less than, or at best equal to, L ().
Thus, is a random variable dened on the unit interval (0,1) whose pdf,
f (| 0 ) (determined by f (x, )), can be used, in principle, to test H0 in Eq
(15.148) versus Ha in Eq (15.151). It should not come as a surprise that,
in general, the form of f (| 0 ) can be quite complicated. However there are
certain general principles regarding the use of for hypothesis testing:
1. If a specic sample x1 , x2 , . . . , xn , generates a value of close to zero, the
implication is that the observation is highly unlikely to have occurred
had H0 been true relative to the alternative;
2. Conversely, if is close to 1, then the likelihood of the observed data,
x1 , x2 , . . . , xn , occurring if H0 is true is just about as high as the unrestricted likelihood that can take any value in the entire unrestricted
parameter space ;
3. Thus, small values of provide evidence against the validity of H0 ;
larger values provide evidence in support.
How small has to be to trigger rejection of H0 is formally determined
in the usual fashion: using the distribution for , the pdf f (|0 ), obtain a
critical value, c , such that P ( < c ) = , i.e.,
c
P ( < c ) =
f (|0 ) =
(15.157)
0

Any value of less than this critical value will trigger rejection of H0 .
Likelihood ratio tests are very general; they can be used even for cases
involving structurally dierent H0 and Ha probability distributions, or for
random variables that are correlated. While the form of the pdf for that is
appropriate for each case may be quite complicated, in general, it is always
possible to perform the required computations numerically using computer
programs. Nevertheless, there are many special cases for which closed-form
analytical expressions can be derived directly either for f (| 0 ), the pdf of
itself, or else for the pdf of a monotonic function of . See Pottmann et al.,
(2005)2 , for an application of the likelihood ratio test to an industrial sensor
data analysis problem.
2 Pottmann, M., B. A. Ogunnaike, and J. S. Schwaber, (2005). Development and Implementation of a High-Performance Sensor System for an Industrial Polymer Reactor, Ind.
Eng. Chem. Res., 44, 2606-2620.

Hypothesis Testing

15.9.2

619

Special Cases

Normal Population; Known Variance

Consider rst the case where a random variable X N (, 2 ), has known
variance, but an unknown mean; and let X1 , X2 , . . . , Xn be a random sample
from this population. From a specic sample data set, x1 , x2 , . . . , xn , we wish
to test H0 : = 0 against the alternative, Ha : = 0 .
Observe that in this case, with = (1 , 2 ) = (, 2 ), the parameter spaces
of interest are:
0 = {(1 , 2 ) : 1 = 0 ; 2 = 2 }
(15.158)
and,
= 0 a = {(1 , 2 ) : < 1 = < ; 2 = 2 }

(15.159)

Since f (x, ) is Gaussian, the likelihood function, given the data, is

(xi )2
1
exp
2 2
2
i=1
n/2
n
i=1 (xi )2
1
1
exp
2
n
2 2
n

L(, ) =

(15.160)

This function is maximized (when 2 is known) by the maximum likelihood

thus, the unrestricted maximum value,
estimator for , the sample average, X;

L (), is obtained by introducing X for in Eq (15.160); i.e.,

L () =

1
2

n/2

1
exp
n

i=1 (xi
2 2

2
X)

(15.161)

On the other hand, the likelihood function, restricted to 0 (i.e., = 0 )

is obtained by introducing 0 for in Eq (15.160). Because, in terms of ,
this function is now a constant, its maximum (in terms of ) is given by:

L (0 ) =

1
2

n/2

1
exp
n

i=1 (xi
2 2

0 )2

From here, the likelihood ratio statistic is obtained as:

n
i=1 (xi 0 )2
exp
22

n
=
2
i=1 (xi X)
exp
2
2

(15.162)

(15.163)

(X
0 )]2 so that:
Upon rewriting (xi 0 )2 as [(xi X)
n

i=1

(xi 0 )2 =

n

2 + n(X
0 )2
(xi X)
i=1

(15.164)

620

Random Phenomena

and upon further simplication, the result is:

0 )2
n(X
= exp
2 2

(15.165)

To proceed from here, we need the pdf for the random variable, ; but rather
than confront this challenge directly, we observe that:

2
X 0

= Z2
(15.166)
2 ln =
/ n
where Z, of course, is the familiar z-test statistic

X 0

Z=
/ n

(15.167)

with a standard normal distribution, N (0, 1). Thus the random variable,
= 2 ln , therefore has a 2 (1) distribution. From here it is now a
straightforward exercise to obtain the rejection region in terms of not , but
= 2 ln (or Z 2 ). For a signicance level of = 0.05, we obtain from tail
area probabilities of the 2 (1) distribution that
P (Z 2 3.84) = 0.05

(15.168)

so that the null hypothesis is rejected when:

0 )2
n(X
> 3.84
2

(15.169)

Upon taking square roots, being careful to retain both positive as well as
negative values, we obtain the familiar rejection conditions for the z-test:

n(X 0 )
< 1.96 or

n(X 0 )
> 1.96
(15.170)

The LR test under these conditions is therefore exactly the same as the z-test.

Normal Population; Unknown Variance

When the population variance is unknown for the test discussed above,
some things change slightly. First, the parameter spaces become:
0 = {(1 , 2 ) : 1 = 0 ; 2 = 2 > 0}

(15.171)

along with,
= 0 a = {(1 , 2 ) : < 1 = < ; 2 = 2 > 0}

(15.172)

Hypothesis Testing

621

The likelihood function remains the same:

n/2
n
i=1 (xi )2
1
1
L(, ) =
exp
n
2

2 2
but this time both parameters are unknown, even though the hypothesis test is
on alone. As a result, the function is maximized by the maximum likelihood
and,
estimators for both , and 2 respectively, the sample average, X,
n
2
(x1 X)

2 = i=1
n
as obtained in Chapter 14.
The unrestricted maximum value, L (), in this case is obtained by introducing these ML estimators for the respective unknown parameters in Eq
(15.160) and rearranging to obtain:

n/2
n
n
en/2
(15.173)
L () =
2
2 i=1 (xi X)
When the parameters are restricted to 0 , this time, the likelihood function is maximized, after substituting = 0 , by the MLE for 2 , so that the
largest likelihood value is obtained as:

n/2
n

n
en/2
(15.174)
L (0 ) =
2 i=1 (xi 0 )2
Thus, the likelihood ratio statistic becomes:
n
2
(xi X)
= ni=1
2
i=1 (xi 0 )

n/2

(15.175)

And upon employing the sum-of-squares identity in Eq (15.164), and simplifying, we obtain:

n/2

1
=
(15.176)
2

X
0)
1 + n(

n
2

X)
i
i=1
n
2 /(n 1), this
If we now introduce the sample variance S 2 = i=1 (xi X)
expression is easily rearranged to obtain:
$n/2
#
1
(15.177)
=
2

1 n(X
0)
1 + n1
S2
As before, to proceed from here, we need to obtain the pdf for the random
variable, . However, once again, we recognize a familiar statistic embedded
in Eq (15.177), i.e.,

2
X 0

T2 =
(15.178)
S/ n

622

Random Phenomena

where T has the students t-distribution with = n 1 degrees of freedom.

The implication therefore is that:
2/n =

1
1 + T 2 /

(15.179)

From here we observe that because 2/n (and hence ) is a strictly monotonically decreasing function of T 2 in Eq (15.179), then the rejection region
< c for which say P ( < c ) = , is exactly equivalent to a rejection
region T 2 > t2c , for which,
(15.180)
P (T 2 > t2c ) =
Once more, upon taking square roots, retaining both positive as well as negative values, we obtain the familiar rejection conditions for the t-test:
0 )
(X

S/ n
0 )
(X

S/ n

t/2 () or

t/2 ()

(15.181)

which, of course, is the one-sample, two-sided t-test for a normal population

with unknown variance.
Similar results can be obtained for tests concerning the variance of a single
normal population (yielding the 2 -test) or concerning two variances from
independent normal populations, yielding the F -test.
The point, however, is that having shown that the LR tests in these wellknown special cases reduce to tests with which we are already familiar, we
have the condence that in the more complicated cases, where the population
pdfs are non-Gaussian and closed-form expressions for cannot be obtained
as easily, the results (mostly determined numerically), can be trusted.

15.9.3

Asymptotic Distribution for

As noted repeatedly above, it is often impossible to obtain closed-form pdfs

for the likelihood ratio test statistic, , or for appropriate functions thereof.
Nevertheless, for large samples, there exists an asymptotic distribution:
Asymptotic Distribution Result for LR Test Statistic: The
distribution of the random variable = 2 ln tends asymptotically to a 2 () distribution with degrees of freedom, with
= Np () Np (0 ) where Np () is the number of independent
parameters in the parameter space in question. i.e., the number of
parameters in exceeds those in 0 by .
Observe, for example, that the distribution of = 2 ln in the rst special
case (Gaussian distribution with known variance) is exactly 2 (1): contains
one unknown parameter, , while 0 contains no unknown parameter since
= 0 .

Hypothesis Testing

623

This asymptotic result is exactly equivalent to the large sample approximation to the sampling distribution of means of arbitrary populations. Note
that in the second special case (Gaussian distribution with unknown variance),
contains two unknown parameter, and 2 , while 0 contains only one
unknown parameter, 2 . The asymptotic distribution of = 2 ln will then
also be 2 (1), in precisely the same sense in which t() N (0, 1).

15.10

Discussion

This chapter should not end without bringing to the readers attention
some of the criticisms of certain aspects of hypothesis testing. The primary issues have to do not so much with the mathematical foundations of the methodology as with the implementation and interpretation of the results in practice.
Of several controversial issues, the following are three we wish to highlight:
1. Point null hypothesis and statistical-versus-practical signicance: When
the null hypothesis about a population parameter is that = 0 , where
0 is a point on the real line, such a literal mathematical statement, can
almost always be proven false with computations carried to a sucient
number of decimal places. For example, if 0 = 75.5, a large enough
sample that generates x = 75.52 (a routine possibility even when the
population parameter is indeed 75.5) will lead to the rejection of H0 , to
two decimal places. However, in actual practice (engineering or science),
is the distinction between two real numbers 75.5 and 75.52 truly of
importance? i.e., is the statement 75.5 = 75.52, which is true in the
strictest, literal mathematical sense, meaningful in practice? Sometimes
yes, sometime no; but the point is that such null hypotheses can almost
always be falsied, raising the question: what then does rejecting H0
really mean?
2. Borderline p-values and variability: Even when the p-value is used to
determine whether or not to reject H0 , it is still customary to relate the
computed p-value to some value of , typically 0.05. But what happens
for p = 0.06, or p = 0.04? Furthermore, an important fact that often
goes unnoticed is that were we to repeat the experiment in question,
the new data set will almost always lead to results that are dierent
from those obtained earlier; and consequently the new p-value will also
be dierent from that obtained earlier. One cannot therefore rule out
the possibility of a borderline p-value switching sides purely as a
result of intrinsic variability in the data.
3. Probabilistic interpretations: From a more technical perspective, if
represents the observed discrepancy between the observed postulated

624

Random Phenomena
population parameter and the value determined from data (a realization
of the random variable, ), the p-value (or else the actual signicance
level of the test) is dened as P ( |H0 ); i.e., the probability of
observing the computed dierence or something more extreme if the null
hypothesis is true. In fact, the probability we should be interested in is
the reverse: P (H0 | ), i.e., the probability that the null hypothesis
is true given the evidence in the data, which truly measures how much
the observed data supports the proposed statement of H0 . These two
conditional probabilities are generally not the same.

In light of these issues (and others we have not discussed here), how should one
approach hypothesis testing in practice? First, statistical signicance should
not be the only factor in drawing conclusions from experimental results
the nature of the problem at hand should be taken into consideration as well.
The yield from process A may in fact not be precisely 75.5% (after all, the
probability that a random variable will take on a precise value on the real
line is exactly zero), but 75.52% is suciently close that the dierence is of
no practical consequence. Secondly, one should be careful in basing the entire
decision about experimental results on a single hypothesis test, especially
with p-values at the border of the traditional = 0.05. A single statistical
hypothesis test of data obtained in a single study is just that: it can hardly be
considered as having denitively conrmed something. Thirdly, decisions
based on condence intervals around the estimated population parameters
tend to be less confusing and are more likely to provide the desired solution
more directly.
Finally, the reader should be aware of the existence of other recently proposed alternatives to conventional hypothesis testing, e.g. Jones and Tukey
(2000)3 , or Killeen (2005)4. These techniques are designed to ameliorate some
of the problems discussed above, but any discussions on them, even of the
most cursory type, lie outside of the intended scope of this chapter. Although
not yet as popular as the classical techniques discussed here, they are worth
exploring by the curious reader.

15.11

Summary and Conclusions

If the heart of statistics is inferencedrawing conclusions about populations from information in a samplethen this chapter and Chapter 14 jointly
constitute the heart of Part IV of this book. Following the procedures dis3 L. V. Jones, and J. W. Tukey, (2000): A Sensible Formulation of the Signicance Test,
Psych. Methods, 5 (4) 411-414
4 P.R. Killeen, (2005): An Alternative to Null-Hypothesis Signicance Tests, Psychol
Sci, 16(5), 345-353

Hypothesis Testing

625

cussed in Chapter 14 for determining population parameters from sample

data, we have focussed primarily in this chapter on procedures by which one
makes and tests the validity of assertive statements about these population
parameters. Thus, with some perspective, we may now observe the following:
in order to characterize a population fully using the information contained
in a nite sample drawn from it, (a) the results of Chapter 13 enable us to
characterize the variability in the sample, so that (b) the unknown parameters may be estimated with a prescribed degree of condence using the techniques in Chapter 14; and (c) what these estimated parameters tell us about
the true population characteristics is then framed in the form of hypotheses
that are subsequently tested using the techniques presented in this chapter.
Specically, the null hypothesis, H0 , is stated as the status quo characteristic;
this is then tested against an appropriate alternative that we are willing to
entertain should there be sucient evidence in the sample data against the
validity of the null hypothesiseach null hypothesis and the specic competing alternative having been jointly designed to answer the specic question of
interest. This has been a long chapter, and perhaps justiably so, considering
the sheer number of topics covered; but the key results can be summarized
briey as we have done in Table 15.12 since hypotheses tests can be classied
into a relatively small number of categories. There are tests for population
means (for single populations or two populations; with population variance
known, or not known; with large samples or small); there are also tests for
(normal) population variances (single variances or two); and then there are
tests for proportions, (one or two). In each case, once the appropriate test
statistic is determined, with slight variations depending on specic circumstances, the principles are all the same. With xed signicance levels, , the
H0 rejection regions are determined and are used straightforwardly to reach
conclusions about each test. Alternatively, the p-value (also known as the observed signicance level ) is easily computed and used to reach conclusions.
It bears restating that in carrying out the required computations not only in
this chapter but in the book as a whole, we have consistently advocated the
use of computer programs such as MINITAB. These programs are so widely
available now that there is practically no need to make reference any longer
to old-fashioned statistical tables. As a result, we have left out all but the
most cursory references to any statistical tables, and instead included specic
illustrations of how to use MINITAB (as an example software package).
The discussions of power and sample size considerations is important, both
as a pre-experimentation design tool and as a post-analysis tool for ascertaining just how much stock one can realistically put in the result of a justconcluded test. Sadly, such considerations are usually given short-shrift by
most students; this should not be the case. It is also easy to develop the
mistaken notion that statistical inference is only concerned with Gaussian
populations. Once more, as in Chapter 14, it is true that the general results
we have presented have been limited to normal populations. This is due to
the stubborn individuality of non-Gaussian distributions and the remarkable

626

Random Phenomena

versatility of the Gaussian distribution both in representing truly Gaussian

populations (of course), but also as a reasonable approximation to the sampling distribution of the means of most non-Gaussian populations. Nevertheless, the discussion in Section 15.8 and the overview of likelihood ratio tests
in Section 15.9 should serve to remind the reader that there is statistical inference life beyond samples from normal populations. A few of the exercises
and application problems at the end of the chapter also buttress this point.

There is a sense in which the completion of this chapter can justiably be

considered as a pivotal point in the journey that began with the illustrative
examples of Chapter 1. These problems, posed long ago in that introductory
chapter, have now been fully solved in this chapter; and, in a very real sense,
many practical problems can now be solved using only the techniques discussed
up until this point. But this chapter is actually a convenient launching point
for the rest of the discussion in this book, not a stopping point. For example, we
have only discussed how to compare at most two population means; when the
problem calls for the simultaneous comparison of more than two population
means, the appropriate technique, ANOVA, is yet to be discussed. Although
based on the F -test, to which we were introduced in this chapter, there is
much more to the approach, as we shall see later, particularly in Chapter 19.
Furthermore, ANOVA is only a partalbeit a foundational partof Chapter
19, a chapter devoted to the design of experiments, the third pillar of statistics,
which is concerned with ensuring that the samples used for statistical inference
are as information rich as possible.

Immediately following this chapter, Chapter 16 (Regression Analysis) deals

with estimation of a dierent kind, when the population parameters of interest are not constant as they have been thus far, but functions of another
variable; naturally, much of the results of Chapter 14 and this current chapter
are employed in dealing with such problems. Chapter 17 (Probability Model
Validation) builds directly on the hypothesis testing results of this chapter in
presenting techniques for explicitly validating postulated probability models;
Chapter 18 (Nonparametric Methods) presents distribution free versions of
many of the hypothesis tests discussed in this current chaptera useful set of
tools to have when one is unsure about the validity of the probability distributional assumptions (mostly the normality assumption) upon which classical
tests are based. Even the remaining chapters beyond Chapter 19 (on case
studies and special topics) all draw heavily from this chapter. A good grasp
of the material in this chapter will therefore facilitate comprehension of the
upcoming discussions in the remainder of the book.

Hypothesis Testing

627

REVIEW QUESTIONS
1. What is a statistical hypothesis?
2. What dierentiates a simple hypothesis from a composite one?
3. What is H0 , the null hypothesis, and what is Ha , the alternative hypothesis?
4. What is the dierence between a two-sided and a one-sided hypothesis?
5. What is a test of a statistical hypothesis?
6. How is the US legal system illustrative of hypothesis testing?
7. What is a test statistic?
8. What is a critical/rejection region?
9. What is the denition of the signicance level of a hypothesis test?
10. What are the types of errors to which hypothesis tests are susceptible, and what
are their legal counterparts?
11. What is the -risk, and what is the -risk?
12. What is the power of a hypothesis test, and how is it related to the -risk?
13. What is the sensitivity of a test as opposed to the specicity of a test?
14. How are the performance measures, sensitivity and specicity, related to the
-risk, and the -risk?
15. What is the p-value, and why is it referred to as the observed signicance level ?
16. What is the general procedure for carrying out hypothesis testing?
17. What test statistic is used for hypotheses concerning the single mean of a normal
population when the variance is known?
18. What is a z-test?
19. What is an upper-tailed test as opposed to a lower-tailed test?
20. What is the one-sample z-test?
21. What test statistic is used for hypotheses concerning the single mean of a normal

628

Random Phenomena

population when the variance is unknown?

22. What is the one-sample t-test, and what dierentiates it from the onesample z-test?
23. How are condence intervals related to hypothesis tests?
24. What test statistic is used for hypotheses concerning two normal population
means when the variances are known?
25. What test statistic is used for hypotheses concerning two normal population
means when the variances are unknown but equal?
26. What test statistic is used for hypotheses concerning two normal population
means when the variances are unknown but unequal?
27. Is the distribution of the t-statistic used for the two-sample t-test with unknown
and unequal variances an exact t-distribution?
28. What is a paired t-test, and what are the important characteristics that set the
problem apart from the general two-sample t-test?
29. In determining power and sample size, what is the z-shift?
30. In determining power and sample size, what are the three hypothesis test characteristic parameters making up the z-shift? What is the equation relating them
to the - and -risks?
31. How can the -risk be reduced without simultaneously increasing the -risk?
32. What are some practical considerations discussed in this chapter regarding the
determination of the power of a hypothesis test and sample size?
33. For general power and sample size determination problems, it is typical to specify
which two problem characteristics, leaving which three parameters to be determined?
34. What is the test concerning the single variance of a normal population variance
called?
35. What test statistic is used for hypotheses concerning the single variance of a
normal population?
36. What test statistic is used for hypotheses concerning two variances from mutually independent normal populations?
37. What is the F -test?
38. The F -test is quite sensitive to which assumption?

Hypothesis Testing

629

39. What test statistic is used in the large sample approximation test concerning a
single population proportion?
40. What is the objective criterion for ascertaining the validity of the large sample
assumption in tests concerning a single population proportion?
41. What is involved in exact tests concerning a single population proportion?
42. What test statistic is used for hypotheses concerning two population proportions?
43. What is the central issue in testing hypotheses about non-Gaussian populations?
44. How does sample size inuence how hypotheses about non-Gaussian populations
are tested?
45. What options are available when testing hypotheses about non-Gaussian populations with small samples?
46. What are likelihood ratio tests?
47. What is the likelihood ratio test statistic?
48. Why is the likelihood ratio parameter such that 0 < < 1? What does a
value close to zero indicate? And what does a value close to 1 indicate?
49. Under what condition does the likelihood ratio test become identical to the familiar z-test?
50. Under what condition does the likelihood ratio test become identical to the familiar t-test?
51. What is the asymptotic distribution result for the likelihood ratio statistic?
52. What are some criticisms of hypothesis testing highlighted in this chapter?
53. In light of some of the criticisms discussed in this chapter, what recommendations
have been proposed for approaching hypothesis testing in practice?

EXERCISES
Section 15.2
15.1 The target mooney viscosity of the elastomer produced in a commercial process is 44.0; if the average mooney viscosity of product samples acquired from
the process hourly and analyzed in the quality control laboratory exceeds or falls

630

Random Phenomena

below this target, the process is deemed out of control and in need of corrective
control action. Formulate the decision-making about the process performance as a
hypothesis test, stating the null and the alternative hypotheses.
15.2 A manufacturer of energy-saving light bulbs wants to establish that the lifetime
of its new brand exceeds the specication of 1,000 hours. State the appropriate null
and alternative hypotheses.
15.3 A pharmaceutical company wishes to show that its newly developed acne medication reduces teenage acne by an average of 55% in the rst week of usage. What
are the null and alternative hypotheses?
15.4 The owner of a eet of taxi cabs wants to determine if there is a dierence in
the lifetime of two dierent brands of car batteries used in the eet of cabs. State
the appropriate null and alternative hypotheses.
15.5 The safety coordinator of a manufacturing facility wishes to demonstrate that
the mean time (in days) between safety incidents has deteriorated from the traditional 30 days. What are the appropriate null and alternative hypotheses?
15.6 Consider X1 , X2 , . . . , Xn , a random sample from a normal population with a
postulated mean 0 but known variance;
(i) If a test is based on the following criterion

0
X
> 1.65
P
/ n
what is (a) the type of hypothesis being tested; (ib) the test statistic; and (c) the
signicance level?
(ii) If instead, the variance is unknown, and the criterion is changed to:

0
X
> 2.0 = 0.051
P
s/ n
what is n?
15.7 Given a sample of size n = 15 from a normal distribution with unknown mean
and variance, a t-test statistic of 2.0 was determined for a one-sided, upper-tailed
test. Determine the associated p-value. From a dierent sample of unknown size
drawn from the same normal population, the rejection region for a test at the signicance level of = 0.05 was determined as t > 1.70. What is n?
15.8 Consider a random sample, X1 , X2 , . . . , Xn from an exponential population,
possesses a gamma (n, /n) distribuE (). It is known that the sample mean, X,
tion. A hypothesis test regarding the population parameter is to be determined
using the following criterion:

X
< CR = 0.95
P CL <

For this problem, what is (i) the type of hypothesis being tested; (ii) the test statistic; (iii) the signicance level; and (iv) if n = 20, the rejection region?

Hypothesis Testing

631

Section 15.3
15.9 A random sample of size n = 16 drawn from a normal population with hypothesized mean 0 = 50 and known variance, 2 = 25, produced a sample average
x
= 47.5.
(i) Compute the appropriate test statistic.
(ii) If the alternative hypothesis is Ha : < 0 , at the = 0.1 signicance level,
determine the rejection region. Should the null hypothesis H0 : = 0 be rejected?
15.10 Refer to Exercise 15.9. Determine the 95% condence interval estimate for
the population mean and compare it with the hypothesized mean. What does this
imply about whether or not the null hypothesis H0 : = 0 should be rejected at
the = 0.05 level? Determine the p-value associated with this test.
15.11 Refer to Exercise 15.9. If instead the population variance is unknown and the
sample variance is obtained as s2 = 39.06, should the null hypothesis H0 : = 0
be rejected at the = 0.1 level, and the = 0.05 level?
15.12. The following random sample was obtained from a normal population with
variance given as 1.00.
SN = {9.37, 8.86, 11.49, 9.57, 9.15, 9.10, 10.26, 9.87, 7.82, 10.47}
To test the hypothesis that the population mean is dierent from a postulated value
of 10.00, (i) state the appropriate null and alternative hypotheses; (ii) determine the
appropriate test statistic; (iii) determine the rejection region for a hypothesis test
at the = 0.05 signicance level; (iv) determine the associated p-value. (v) What
conclusion should be drawn from this test?
15.13 Refer to Exercise 15.12. Repeat for the case where the population variance is
unknown. Does this fact change the conclusion of the test?
15.14 A random sample of size n = 50 from a normal population with = 3.00
produced a sample mean of 80.05. At a signicance level = 0.05,
(i) Test the null hypothesis that the population mean 0 = 75.00 against the alternative that 0 > 75.00; interpret the result of the test.
(ii) Test the null hypothesis against the alternative 0 = 75.00. Interpret the result
of this test and compare it to the test in (i). Why are the results dierent?
15.15 In carrying out a hypothesis test of H0 : = 100 versus the alternative,
Ha : > 100, given that the population variance is 1600, it has been recommended
to reject H0 in favor of Ha if the mean of a random sample of size n = 100 exceeds
106. What is , the signicance level behind the statement?
Section 15.4
15.16 Random samples of 50 observations are each drawn from two independent
represents the sample
normal populations, N (10.0, 2.52 ), and N (12.0, 3.02 ); if X
mean from the rst population, and Y is the sample mean from the second population,
and Y ;
(i) Determine the sampling distribution for X

(ii) Determine the mean and variance of the sampling distribution for Y X;

632

Random Phenomena

(iii) Determine the z-statistic associated with actual sample averages obtained as
x
= 10.9 and y = 11.8. Use this to test H0 : Y = X against Ha : Y > X .
(iv) Since we know from the supplied population information that Y > X , interpret the results of the test in (iii).
15.17 Two samples of sizes n1 = 20 and n2 = 15 taken from two independent normal populations with known standard deviations, 1 = 3.5 and 2 = 4.2, produced
2 = 13.8. At the = 0.05 signicance level, test
sample averages, x
1 = 15.5 and x
the null hypothesis that the means are equal against the alternative that they are
not. Interpret the result. Repeat this test for Ha : 1 > 2 ; interpret the result.
15.18 The data in the table below is a random sample of 15 observations each from
two normal populations with unknown means and variances. Test the null hypothesis
that the two population means are equal against the alternative that Y > X . First
assume that the two population variances are equal. Interpret your results. Repeat
the test without assuming equal variances. Is there a dierence in the conclusions?
Sample
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

X
12.03
13.01
9.75
11.03
5.81
9.28
7.63
5.70
11.75
6.28
12.53
10.22
7.17
11.36
9.16

Y
13.74
13.59
10.75
12.95
7.12
11.38
8.69
6.39
12.01
7.15
13.47
11.57
8.81
13.10
11.32

15.19 Refer to Exercise 15.18. (i) On the same graph, plot the data for X and for Y
against sample number. Comment on any feature that might indicate whether the
two samples can be treated as independent or not.
(ii) Treat the samples as 15 paired observations and test the null hypothesis that
the two population means are equal against the alternative that Y > X . Interpret
your result and compare it with the results of Exercise 15.18.
15.20 The data below are random samples from two independent lognormal distributions; specically, XL1 L(0, 0.25) and XL2 L(0.25, 0.25).

Hypothesis Testing
XL1
0.81693
0.96201
1.03327
0.84046
1.06731
1.34118
0.77619
1.14027
1.27021
1.69466

633

XL2
1.61889
1.15897
1.17163
1.09065
1.27686
0.91838
1.45123
1.47800
2.16068
1.46116

(i) For the time being, ignore the fact that the sample size is too small to make
the normal approximation valid for the sampling distribution of the sample means.
At the = 0.05 signicance level, carry out a two-sample t-test concerning the
equality of the means of these two populations, against the alternative that they
are not equal. Interpret your results in light of the fact that we know that the two
populations are not equal.
(ii) Fortunately, a logarithmic transformation of lognormally distributed data yields
normally distributed data; as a result, let Y1 = ln XL1 and Y2 = ln XL2 and repeat
(i) for the log transformed Y1 and Y2 data. Interpret your results.
(iii) Comment on the implication of these results on the inappropriate use of the
normal approximation as well as the use of = 0.05 in a dogmatic fashion.
= 38.8 and Y = 42.4 were obtained from ran15.21 The sample averages X
dom samples taken from two independent populations of respective sizes nx = 120
and ny = 90. The corresponding sample standard deviations were obtained as
s2x = 20; s2y = 35. At the = 0.05 signicance level, test the hypothesis that the population mean Y is greater than X . Interpret your result. How will the result change
if instead, the hypothesis to be tested is that the two population means are dierent?
Section 15.5
15.22 A random sample of size n = 100 from a normal population with unknown
mean and variance is to be used to test the null hypothesis, H0 : = 12, versus the
alternative, H0 : = 12. The observed sample standard deviation is s = 0.5.
(i) Determine the rejection region for = 0.1, 0.05 and = 0.01.
(ii) If the true population mean has shifted to = 11.9, determine the value of
corresponding to each of the rejection regions obtained in (i) and hence the power
of each test. Comment on the eect that lowering has on the corresponding values
of and power.
15.23 In the following, given = 0.05 and = 1.0, determine the missing hypothesis test characteristic parameter for a two-sided, 1-sample z-test:
(i) Power = 0.9, sample size = 40; and Power = 0.9, sample size = 20.
(ii) Power = 0.75, sample size = 40; and Power = 0.75, sample size = 20.
(iii) Power = 0.9, hypothesized mean shift to be detected = 0.5, and Power = 0.9,
hypothesized mean shift to be detected = 0.75.
(iv) Power = 0.75, hypothesized mean shift to be detected = 0.5, and Power = 0.75,
hypothesized mean shift to be detected = 0.75.
(v) Hypothesized mean shift to be detected = 0.5, sample size = 40; and Hypothe-

634

Random Phenomena

sized mean shift to be detected = 0.5, sample size = 20.

(vi) Hypothesized mean shift to be detected = 0.75, sample size = 40; and Hypothesized mean shift to be detected = 0.75, sample size = 20.
15.24 Refer to Exercise 15.23. Repeat for 1-sample t-test. Comment on any dierences in the computed characteristic parameter and whether or not this dierence
will mean anything in practice.
15.25 Refer to the data in Exercise 15.20 where a logarithmic transformation of
the data yielded Y1 and Y2 that are random samples from normal populations. The
respective postulated means are 0 and 0.25, and the postulated standard deviations
are both equal to 0.25. What is the power of the two-sided test of H0 : Y1 = Y2 = 0
versus Ha : Y1 = Y2 (because Y2 = 0.25), carried out with the indicated sample
of 10 observations at the = 0.05 signicance level? What sample size would have
been required in order to carry out this test with power 0.9 or better?
15.26 Samples are to be taken from two independent normal populations, one with
variance 12 = 10, the other with a variance twice the magnitude of the rst one. If
the dierence between the two population means is to be estimated to within 2
with a two-sample test at the = 0.05 signicance level, determine the sample size
required for the test to have power of 0.9 or better, assuming that n1 = n2 = n .
State any other assumptions needed to answer this question.
15.27 If two variables y and x are related according to
y = f (x)
the relative sensitivity of y to changes in x is dened as:
Sr =

ln y
ln x

It provides a measure of how relative changes x/x in x translate to corresponding

relative changes y/y in y.
Now recall the expression given in Eq (15.98) which relates the sample size required to detect at a minimum, a signal of magnitude , in the midst of intrinsic
noise, characterized by standard deviation , using an upper-tailed test at the signicance level = 0.05 and with power (1 ) = 0.9, i.e.,

n=

2.925
SN

Show that the relative sensitivity of the sample size n to the signal-to-noise ratio
SN = / is 2, thus establishing that a 1% increase in the signal-to-noise ratio
SN translates to an (instantaneous) incremental reduction of 2% in sample size
requirements. Comment on ways by which one might increase signal-to-noise ratio
in practical problems.
Section 15.6
15.28 The variance of a sample of size n = 20 drawn from a normal population
with mean 100 was obtained as s2 = 9.5. At the = 0.05 signicance level, test the

Hypothesis Testing

635

hypothesis that the true population variance is 10.

15.29 Refer to the data in Exercise 15.20. A logarithmic transformation of the data
is postulated to yield Y1 and Y2 , random samples from normal populations with
respective postulated means 0 and 0.25, and postulated equal standard deviation
0.25. Is there evidence in the log-transformed data to support the hypothesis that
Y1 = 0.25, and the hypothesis that Y2 = 0.25?
15.30 A sample of 20 observations from a normal population was used to carry
out a test concerning the unknown population variance at the = 0.05 signicance
level. The hypothesis that the population variance is equal to a postulated value,
0 , was eventually rejected in favor of the alternative that the population variance
is higher. What is the relationship between the observed sample variance s2 and the
postulated variance?
15.31 Refer to Exercise 15.12 and the supplied data purported to be a random
sample obtained from a normal population with variance given as 1.00.
SN = {9.37, 8.86, 11.49, 9.57, 9.15, 9.10, 10.26, 9.87, 7.82, 10.47}
At the = 0.05 signicance level conrm or refute this postulate about the population variance.
15.32 Refer to Exercise 15.29 and conrm directly the postulate that the two variances, Y1 and Y2 are equal. State the p-value associated with the test.
15.33 (i) Determine the rejection region to be used in testing H0 : 12 = 22 against
Ha : 12 = 22 , with 1 = 30, 2 = 25, for each of the following cases: (a) = 0.05
(b) = 0.10
(ii) Repeat (i) when the alternative is Ha : 12 > 22
15.34 A random sample of size n1 = 12 from a normal population with unknown
mean 1 , and unknown variance 12 , and another independent random sample of size
n2 = 13 from a dierent normal population with unknown mean 2 , and unknown
variance 22 , are to be sued to test the null hypothesis H0 : 22 = k12 against the
alternative Ha : 22 > k12 . Using the F -statistic, S22 /S12 , the critical region was
obtained as f > 5.58 at the = 0.05 signicance level. Determine the value of the
constant k.
15.35 The risk associated with two dierent stocks is quantied by the volatility
i.e., the variability in daily prices was determined as s21 = 0.58 for the rst one and
s22 = 0.21 from the second, the variances having been determined from a random
sample of 25 daily price changes in each case. Test the hypothesis H0 : 12 = 22
against Ha : 12 > 22 . What is the p-value of the test? Interpret your result.
Section 15.7
15.36 A random sample of 100 observations from a binomial population resulted in
an estimate p = 0.72 of the true population proportion, p. Is the sample size large
enough to use the large sample approximation to test the hypothesis H0 : p = 0.75
against the alternative H0 : p = 0.75? Test the indicated hypothesis at the = 0.05

636

Random Phenomena

level and interpret your result. What is the associated p-value?

15.37 A random sample of 50 observations from a binomial population was used to
estimate the population parameter p as p = 0.63.
(i) Construct a 95% condence interval for p.
(ii) If it is desired to obtain an estimate of p to within 0.10 with 95% condence,
what sample size would be required?
(iii) Extend the result in (ii) to the general case: determine an expression for the
sample size required to estimate the population parameter p with p, with the desired
limits of the (1 ) 100% condence interval specied as .
15.38 A supposedly fair coin was tossed 10 times and 4 heads were obtained. Test
the hypothesis that the coin is fair versus the alternative that it is not fair. Use both
the large sample approximation (check that this is valid rst) and the exact test.
Compare the two p-values associated with each test and interpret your result. Had
the coin been tossed only 5 times with 2 heads resulting (so that the same p would
have been obtained), what will change in how the hypothesis test is carried out?
15.39 A random sample of n1 = 100 observations from one binomial population and
a separate random sample of n2 = 75 observations from a second binomial population produced x1 = 33 and x2 = 27 successes respectively.
(i) At the = 0.05 signicance level, test the hypothesis that the two population
proportions are equal against the alternative that they are dierent.
(ii) Repeat (i) for the alternative hypothesis that p2 > p1 .
15.40 Two binomial population proportions are suspected to be approximately 0.3,
but they are unknown precisely. Using these as initial estimates, it is desired to conduct a study to determine the dierence between these two population proportions.
Obtain a general expression for the sample size n1 = n2 = n required in order to
determine the dierence p1 p2 to within with (1 ) 100% condence. In the
specic case where = 0.02 and = 0.05 what is n?
Section 15.8 and 15.9
15.41 The following data is sampled from an exponential population with unknown
parameter .
6.99
0.52
10.36
5.75

2.84
0.67
1.66
0.12

0.41
2.72
3.26
6.51

3.75
5.22
1.78
4.05

2.16
16.65
1.31
1.52

de(i) On the basis of the exact sampling distribution of the sample mean, X,
termine a 95% condence interval estimate of the population parameter, .
(ii) Test the hypothesis H0 : = 4 versus the alternative H0 : = 4. What is the
p-value associated with this test?
(iii) Repeat the test in (ii) using the normal approximation (which is not necessarily
valid in this case). Compare this test result with the one in (ii).
15.42 Refer to Exercise 15.41. Using the normal approximation, test the hypothesis
that the sample variance is 16 versus the alternative that it is not, at the = 0.05

Hypothesis Testing

637

signicance level. Comment on whether or not this result should be considered as

reliable.
15.43 Given a random sample, X1 , X2 , . . . , Xn from a gamma (, ) distribution,
use the reproductive properties result discussed
in Chapter 8 to obtain the sampling
= (n Xi )/n. In the specic case where the
distribution of the sample mean, X
i=1
population parameters are postulated to be = 2 and = 20 so that the population
mean is = 40, a random sample of size n = 20 yielded an average x
= 45.6. Obtain
an exact 95% condence interval estimate of the true population mean, and from
this test, at the = 0.05 signicance level, the null hypothesis H0 : = 40 versus
the alternative, Ha : = 40.
15.44 Let X1 , X2 , . . . , Xn be a random sample from a normal N (, 2 ) population
with unknown mean and variance. With the parameters represented as = (1 , 2 ) =
(, 2 ), use the likelihood ratio method to construct a hypothesis test for H0 : 2 =
02 . First obtain the likelihood ratio and then show that instead of , the test
statistic
=

(n 1)S 2
02

should be used, where S 2 is the sample variance. Hence, establish that the likelihood
ratio test for the variance of a single normal population is identical to the result obtained in Section 15.6.
15.45 Let X1 , X2 , . . . , Xn be a random sample from an exponential population E ().
Establish that the likelihood ratio test of the hypothesis that = 0 versus the
alternative that = 0 will result in a rejection region obtained from the solution
to the following inequality:
x
x/0
e
k
0
where x
is the observed sample average, and k is a constant.

APPLICATION PROBLEMS
15.46 In a study to determine the performance of processing machines used to add
raisins to trial-size Raisin Bran cereal boxes, (see Example 12.3 in Chapter 12),
6 sample boxes are taken at random from each machines production line and the
number of raisins in each box counted. The result for machines 3 and 4 are shown
below. Assume that these can be considered as random samples from a normal population.

638

Random Phenomena
Machine 3
13
7
11
9
12
18

Machine 4
7
4
7
7
12
18

(i) If the target average number of raisins dispensed per box is 10, by carrying
out appropriate hypothesis tests determine which of the two machines is operating
according to target and which is not. State the p-value associated with each test.
(ii) Is there any signicant dierence between the mean number of raisins dispensed
by these two machines? Support your answer adequately.
15.47 The data table below shows the wall thickness (in ins) of cast aluminum
cylinder heads used in aircraft engine cooling jackets, taken from Mee (1990)5 .
0.223
0.201

0.228
0.223

0.214
0.224

0.193
0.231

0.223
0.237

0.213
0.217

0.218
0.204

0.215
0.226

0.233
0.219

If the manufacturing process is designed to produce cylinder heads whose wall

thicknesses follow a normal distribution with mean wall thickness of 0.22 ins and
standard deviation 0.01 ins, conrm or refute the claim that the process was operating as designed when the samples shown in the table were obtained. State any
assumptions you may need to make in answering this question and keep in mind
that there are two separate parameters in the postulated process characteristics.
15.48 The data set below, S10 , from Holmes and Mergen (1992)6 , is a sample of
viscosity measurements taken from ten consecutive, but independent, batches of a
product made in a batch chemical process.
S10 = {13.3, 14.5, 15.3, 15.3, 14.3, 14.8, 15.2, 14.9, 14.6, 14.1}
The desired target value for the product viscosity is 14.9. Assuming that the viscosity data constitutes a random sample from a normal population with unknown
mean and unknown variance, at the = 0.05 signicance level, test the hypothesis
that the mean product viscosity is on target versus the alternative that it is not.
What is the p-value associated with this test? Interpret your result. If the test were
to be conducted at the = 0.10 signicance level, will this change your conclusion?
15.49 Kerkhof and Geboers, (2005)7 , presented a new approach to modeling multicomponent transport that is purported to yield more accurate predictions. To
demonstrate the performance of their modeling approach, the authors determined,
experimentally, the viscosity (105 P a.s) of 12 dierent gas mixtures and compared
them to the corresponding values predicted by the classical Hirschfelder-Curtiss-Bird
5 Mee, R. W., (1990). An improved procedure for screening based on a correlated, normally distributed variable, Technometrics, 32, 331337.
6 Holmes, D.S., and A.E. Mergen, (1992). Parabolic control limits for the exponentially
weighted moving average control charts Qual. Eng. 4, 487495.
7 Kerkhof, P.J.A.M, and M.A.M. Geboers, (2005). Toward a unied theory of isotropic
molecular transport phenomena, AIChE Journal, 51, (1), 79121

Hypothesis Testing

639

(HCB) model8 and their new (KG) model. The results are shown in the table below.
Viscosity, (105 P a.s)
Experimental
HCB
KG
Data
Predictions Predictions
2.740
2.718
2.736
2.569
2.562
2.575
2.411
2.429
2.432
2.504
2.500
2.512
3.237
3.205
3.233
3.044
3.025
3.050
2.886
2.895
2.910
2.957
2.938
2.965
3.790
3.752
3.792
3.574
3.551
3.582
3.415
3.425
3.439
3.470
3.449
3.476
(i) Treated as paired data, perform an appropriate hypothesis test to compare the
new KG model predictions with corresponding experimental results. Is there evidence to support the claim that this model provides excellent agreement with
experimental data?
(ii) Treated as paired data, test whether there is any signicant dierence between
the HCB model predictions and the new KG model predictions.
(iii) As in (i) perform a test to assess the performance of the classic HCB model
prediction against experimental data. Can the HCB model be considered as also
providing excellent agreement with experimental data?
15.50 The table below, from Lucas (1985)9 , shows the number of accidents occurring
per quarter (three months), over a 10-year period, at a DuPont company facility.
The data set is divided into two periods: Period I for the rst ve-year period of the
study; Period II, for the second ve-year period.

5
4
2
5
6

Period I
5 10 8
5 7
3
8 6
9
6 5 10
3 3 10

3
1
7
1
4

Period II
4 2 0
3 2 2
7 1 4
2 2 1
4 4 4

(i) Perform appropriate tests to conrm or refute the hypothesis that the true
population mean number of accidents in the rst period is 6, while the same population parameter was halved in the second period.
(ii) Separately test the hypothesis that there is no signicant dierence between the
mean number of accidents in each period. State any assumptions needed to answer
these questions.
8 Hirschfelder J.O., C.F. Curtiss, and R.B. Bird (1964). Molecular Theory of Gases and
Liquids. 2nd printing. J. Wiley, New York, NY.
9 Lucas J. M., (1985). Counted Data CUSUMs, Technometrics, 27, 129144.

640

Random Phenomena

15.51 A survey of alumni that graduated between 2000 and 2005 from the chemical engineering department of a University in the Mid-Atlantic region of the US
involved 150 randomly selected individuals: 100 BS graduates and 50 MS graduates.
(The PhD graduates participated in a dierent survey.) The survey showed, among
other things, that 9.5% of BS graduates and 4.5% of MS graduates were unemployed
for at least one year during this period.
(i) If the corresponding national unemployment averages for all BS and MS degree
holders in all engineering disciplines over the same period are, respectively, 15.2%
and 7.5%, perform appropriate hypothesis tests to determine whether or not the
chemical engineering alumni of this University fare better in general than graduates
with corresponding degrees in other engineering disciplines.
(ii) Does having an advanced degree make any dierence in the unemployment status of the alumni of this University? Support your answer adequately.
(iii) In connection with (ii) above, if it is desired to determine any true dierence
between the unemployment status of this Universitys alumni to within 0.5% with
95% condence, how many alumni would have to be sampled? State any assumptions clearly.
15.52 The data set in Problems 1.13 and 14.42, shown in the table below for ease of
reference, is the time (in months) from receipt to publication of 85 papers published
in the January 2004 issue of a leading chemical engineering research journal.
19.2
9.0
17.2
8.2
4.5
13.5
20.7
7.9
19.5
8.8
18.7
7.4
9.7
13.7
8.1
8.4
10.8

15.1
5.3
12.0
3.0
18.5
5.8
6.8
14.5
3.3
11.1
16.4
7.3
7.4
7.3
5.2
10.2
3.1

9.6
12.9
17.3
6.0
24.3
21.3
19.3
2.5
9.1
8.1
9.8
15.4
15.7
8.2
8.8
7.2
12.8

4.2
4.2
7.8
9.5
3.9
8.7
5.9
5.3
1.8
10.1
10.0
18.7
5.6
3.3
7.3
11.3
2.9

5.4
15.2
8.0
11.7
17.2
4.0
3.8
7.4
5.3
10.6
15.2
11.5
5.9
20.1
12.2
12.0
8.8

(i) Using an appropriate probability model for the population from which the data
is a random sample, obtain a precise 95% condence interval for the mean of this
population; use this interval estimate to test the hypothesis by the Editor-in-Chief
that the mean time-to-publication is 9 months, against the alternative that it is
higher.
(ii) Considering n = 85 as a large enough sample size for a normal approximation
for the distribution of the sample mean, repeat (i) carrying out an appropriate
one-sample test. Compare your result with that in (i). How good is the normal
approximation?
(iii) Use the normal approximation to test the hypothesis that the mean time to

Hypothesis Testing

641

publication is actually 10 months, versus the alternative that it is not. Interpret

your result vis a
` vis the result in (ii).
15.53 The data shown in the table below (see Problem 1.15) shows a four-year record
of the number of recordable safety incidents occurring per month at a plant site.
1
0
2
0

0
1
2
1

0
0
0
0

0
1
1
0

2
0
2
0

2
0
0
0

0
0
1
0

0
0
2
0

0
0
1
1

1
0
1
0

0
0
0
0

1
1
0
1

(i) If the company proposes to use as a safety performance interval (SPI), the
statistic

3 X

SI = X
compute this interval from the supplied sample data.
(ii) The utility of the SPI is that any observation falling outside the upper bound is
deemed to be indicative of a potential real increase in the number of safety incidents.
Consider that over the most recent four-month period, the plant recorded 1, 3, 2, 3
safety incidents respectively. According to the SPI criterion, is there evidence that
there has been a real increase in the number of incidents during any of the most
recent four months?
15.54 Refer to Problem 15.53 and consider the supplied data as a random sample
from a Poisson population with unknown mean .
(i) Assuming that a sample size of 48 is large enough for the normal approximation
to be valid for the distribution of the sample mean, use the sample data to test the
hypothesis that the population mean = 0.5 versus the alternative that it is not.
(ii) Use the theoretical value postulated for the population mean and compute the
P (X x0 | = 0.5) for x0 = 1, 2, 3 and hence determine the p-values associated
with the individual hypotheses that each recent observation, 1, 2, and 3, belongs to
the same Poisson population, P(0.5), against the alternative that the observations
belong to a dierent population with a > 0.5.
15.55 In clinics where Assisted Reproductive Technologies such as in-vitro fertilization are used to help infertile couples conceive and bear children (see Chapter 11
for a case study), it is especially important to be able to determine probability of a
single embryo resulting in a live birth at the end of the treatment cycle. As shown
in Chapter 11, determining this parameter, which is equivalent to a binomial probability of success, remains a challenge, however. A typical clinical study, the result of
which may be used to determine this parameter for carefully selected cohort groups,
is described below.
A cohort of 100 patients under the age of 35 years (the Younger group), and
another cohort of the same size, but consisting of patients that are 35 years and older
(the Older group), participated in a clinical study where each patient received ve
embryos in an in-vitro fertilization (IVF) treatment cycle. The results are shown in
the table below: x is the number of live births per delivered pregnancy; yO and yY
represent, respectively, how many in the older and younger group had the pregnancy
outcome of x.
(i) At the = 0.05 signicance level, determine whether or not the single embryo
probability of success parameter, p, is dierent for each cohort group.

642

Random Phenomena

(ii) At the = 0.05, test the hypothesis that p = 0.3 for the older cohort group
versus the alternative that it is less than 0.3. Interpret your result.
(iii) At the = 0.05, test the hypothesis that p = 0.3 for the younger cohort group
versus the alternative that it is greater than 0.3. Interpret your result.
(iv) If it is desired to be able to determine the probability of success parameter for
older cohort group to within 0.05 with 95% condence, determine the size of the
cohort group to use in the clinical study.
x
No. of live
births in a
delivered
pregnancy
0
1
2
3
4
5

yO
Total no. of
older patients
(out of 100)
with pregnancy outcome x
32
41
21
5
1
0

yY
Total no. of
younger patients
(out of 100)
with pregnancy outcome x
8
25
35
23
8
1

15.56 To characterize precisely how many sick days its employees take, a random
sample of 50 employee les was selected and the following statistics were determined:
x
= 9.50 days and s2 = 42.25.
(i) Determine a 95% condence interval on , the true population mean number of
sick days taken per employee.
(ii) Does the evidence in the data support the hypothesis that the mean number of
sick days taken by employees is less than 14.00 days?
(iii) What is the power of the test you conducted in (ii). State any assumptions
clearly. (iv) The personnel director who ordered the study was a bit surprised at
how large the computed sample variance turned out to be. However, the human
resources statistician insisted that this is not necessarily larger than the typical industry value of 2 = 35. Assuming that the sample is from a normal population,
carry out an appropriate test to conrm or refute this claim. What is the p-value
associated with the test?
15.57 It is desired to characterize the precision of two instruments used to measure
the density of a liquid stream in a renerys distillation column. Ten measurements
of a calibration sample with known density 0.85 gm/cc are shown in the table below.
Instrument 1
Measurements
0.864
0.858
0.855
0.764
0.791
0.827
0.849
0.818
0.747
0.846

Instrument 2
Measurements
0.850
0.916
0.861
0.866
0.874
0.901
0.836
0.953
0.733
0.836

Hypothesis Testing

643

Consider these data as random samples from two independent normal populations;
carry out an appropriate test to conrm or refute the hypothesis that instrument 2
is less precise than instrument 1.
15.58 In producing the enzyme cyclodextrin glucosyltransferase in bacterial cultures
via two dierent methods (shaken and surface), Ismail et al., (1996)10 , obtained
the data shown in the table below on the protein content (in mg/ml) obtained by
each method.
Protein content (mg/ml)
Shaken Surface
1.91
1.71
1.66
1.57
2.64
2.51
2.62
2.30
2.57
2.25
1.85
1.15
Is the variability in the protein content the same for both methods? State any
assumptions you may need to make in answering this question.
15.59 The table below (see also Problem 14.41 in Chapter 14) shows the time in
months between occurrences of safety violations for three operators, A, B, and
C, working in a toll manufacturing facility.
A
B
C

1.31
1.94
0.79

0.15
3.21
1.22

3.02
2.91
0.65

3.17
1.66
3.90

4.84
1.51
0.18

0.71
0.30
0.57

0.70
0.05
7.26

1.41
1.62
0.43

2.68
6.75
0.96

0.68
1.29
3.76

Since the random variable in question is exponentially distributed and the sample size of 10 is considerably smaller than is required for a normal approximation
to be valid for the sampling distribution of the sample mean, testing hypotheses
about the dierence between the means of these populations requires a dierent
approach. The precise 95% condence interval estimates of the unknown population
parameters (obtained from the sampling distribution of the mean of an exponential
random variable (Problem 14.41)) can be used to investigate if the population means
overlap. An alternative approach involves the distribution of the dierence between
two exponential random variables.
It can be shown (Exercise 9.3) that given two independent random variables, X1
and X2 , with identical exponential E () distributions, the pdf of their dierence,
Y = X1 X2

(15.182)

is the double exponential (or Laplace) distribution dened as:

f (y) =

1 |y|/
e
; < y <
2

(15.183)

with mean 0 and variance 2 2 . It can be shown that, in part because of the symmetric
10 Ismail A.S, U.I. Sobieh, and A.F. Abdel-Fattah, (1996). Biosynthesis of cyclodextrin
glucosyltransferase and -cyclodextrin by Bacillus macerans 314 and properties of the crude
enzyme. The Chem Eng. J., 61 247253.

644

Random Phenomena

nature of this distribution, the distribution of Y , the mean of a random sample of

size n from this distribution, is approximately Gaussian with mean 0 and variance
2 2 /n. More importantly, again, because of the symmetry of the distribution, the
approximation is quite reasonable even for modest sample sizes as small as n = 10.
Form the dierences YAB = XA XB and YAC = XA XC from the given
data and use the normal approximation to the sampling distribution of the mean
of a Laplace random variable to test the hypotheses that operator A is more safety
conscious on the average than operator B, and also more than operator C. If you
have to make any further assumptions, state them clearly.
15.60 It is desired to determine if there is a signicant dierence in the average
number of high-end vacuum cleaners produced per 8-hour workday by two dierent
assembly plants located in two dierent countries. A random sample of 10 such daily
outputs was selected for each assembly plant from last years production tally, and
the results summarized in the table below:
Statistics
Sample Size, n
Average, x

Variance, s2

Plant A
10
28
10.5

Plant B
10
33
13.2

First ascertain whether or not the two population variances are the same. Then
carry out an appropriate test of the equality of the mean production output that is
commensurate with your ndings regarding the variances. Interpret your results.
15.61 In a certain metal oxide ore rening process, several samples (between 6 and
12) are taken monthly from various sections of the huge uidized bed reactor and analyzed to determine average monthly residual silica content. The table below shows
the result of such analyses for a 6-month period.

Month
Jan
Feb
Mar
Apr
May
Jun

Number of
Samples
Analyzed
12
9
10
11
7
11

Average Silica Content

(coded units)
65.8
36.9
49.4
74.4
59.3
76.6

s
Sample
Standard Deviation
32.1
21.0
24.6
17.8
15.2
17.2

The standard by which the ore rening operation is declared normal for any
month requires a mean silica content = 63.7, and inherent variability, = 21.0;
otherwise the operation is considered abnormal. At a 5% signicance level, identify
those months during which the renery operation would be considered abnormal.
Support your conclusions adequately.

(n1 1)S12 +(n2 1)S22

n1 +n2 2

S2 =
S12 /S22

12 /22 ; (H0 : 12 = 22 )

i=1 (Xi X)

= i=1 Di
D
n
(Di = X1i X2i )

n
2
(D D)
2
SD
= i=1n1i

2 ; (H0 : 2 = 02 )

= 1 2 ; (H0 : = 0 )
(Paired)

T =

Sp2 =

Small sample n < 30

F =

D
0

s2p n1 + n1

+ n2

S12
S22

(n1)S 2
02

D
0
SD / n

2
1
n1

X
0
/ n

X
0
S/ n

D
0
)

C2 =

T =

= 1 2 ; (H0 : = 0 )

(S for unknown )
=X
1 X
2
D

Small sample n < 30

i=1

=
X

; (H0 : = 0 )

Table 15.7

Table 15.5

Table 15.4

Table 15.3

Table 15.2

H0 Rejection
Condition

F -test

Table 15.10

Chi-squared-test Table 15.9

Paired t-test

2-sample t-test

2-sample z-test

t-test

z-test

Summary of Selected Hypothesis Tests and their Characteristics

Population
Parameter,
Point
Test

Estimator,
Statistic
Test
(Null Hypothesis, H0 )

TABLE 15.12:

Hypothesis Testing
645

646

Random Phenomena

Chapter 16
Regression Analysis

16.1 Introductory Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16.1.1 Dependent and Independent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.1.2 The Principle of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2.1 One-Parameter Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2.2 Two-Parameter Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Primary Model Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ordinary Least Squares (OLS) Estimates . . . . . . . . . . . . . . . . . . . . . .
Maximum Likelihood Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Actual Regression Line and Residuals . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2.3 Properties of OLS Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2.4 Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Slope and Intercept Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Regression Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2.5 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2.6 Prediction and Prediction Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2.7 Coecient of Determination and the F-Test . . . . . . . . . . . . . . . . . . . . .
Orthogonal Decomposition of Variability . . . . . . . . . . . . . . . . . . . . . . .
R2 , The Coecient of Determination . . . . . . . . . . . . . . . . . . . . . . . . . .
F-test for Signicance of Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2.8 Relation to the Correlation Coecient . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2.9 Mean-Centered Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2.10 Residual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.3 Intrinsically Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.3.1 Linearity in Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.3.2 Variable Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.4 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.4.1 General Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.4.2 Matrix Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Properties of the Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Residuals Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.4.3 Some Important Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Constrained Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.4.4 Recursive Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Recursive Least Squares Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.5 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.5.1 General Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.5.2 Orthogonal Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
An Example: Gram Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Application in Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

648
648
651
652
652
653
653
654
657
657
659
661
661
663
664
668
670
671
672
674
676
677
678
682
682
685
686
687
688
689
691
694
694
696
697
698
698
699
700
700
704
704
708
710
711

647

648

Random Phenomena
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719

The mathematical facts worthy of being studied

are those which, by their analogy with other facts
are capable of leading us to the knowledge of a mathematical law
just as experimental facts lead us
to the knowledge of a physical law . . .
Henri Poicare (18541912)

It is often the case in many practical problems that the variability observed in
a random variable, Y , consists of more than just the purely randomly varying
phenomena that have occupied our attention up till now. For this new class
of problems, an underlying functional relationship exists between Y and an
independent variable, x, (deliberately written in the lower case for reasons
that will soon become clear), with a purely random component superimposed
on this otherwise deterministic component. This chapter is devoted to dealing
with problems of this kind. The values observed for the random variable Y
depend on the values of the (deterministic) variable, x, and, were it not for
the presence of the purely random component, Y would have been perfectly
predictable given x. Regression analysis is concerned with obtaining, from
data, the best estimate of the relationship between Y and x.
Although apparently dierent from what we have dealt with up until now,
we will see that regression analysis in fact builds directly upon many of the
results obtained thus far, especially estimation and hypothesis testing.

16.1

Introductory Concepts

Consider the data in Table 16.1 showing the boiling point (in C) of 8
hydrocarbons in a homologous series, along with n, the number of carbon
atoms in each molecule. A scatter plot of boiling point versus n is shown in
Fig 16.1, where we notice right away that as the number of carbon atoms in
this homologous series increases, so does the boiling point of the hydrocarbon compound. In fact, the implied relationship between these two variables
appears to be so strong that one is immediately inclined to conclude that it
must be possible to predict the boiling point of compounds in this series on
the basis of the number of carbon atoms. There is therefore no doubt that
there is some sort of a functional relationship between n and boiling point. If
determined correctly, such a relationship will provide, among other things,
a simple way to capture the extensive data on such physical properties of
compounds in this particular homologous series.

Regression Analysis

649

TABLE 16.1:

Boiling points of a series of

hydrocarbons
Hydrocarbon n, Number of Boiling Point

Carbon Atoms
C
Compound
Methane
1
-162
Ethane
2
-88
3
-42
Propane
4
1
n-Butane
n-Pentane
5
36
6
69
n-Hexane
7
98
n-Heptane
n-Octane
8
126

100

Boiling Point, Deg C

50
0
-50
-100
-150
-200
0

3
4
5
6
n, Number of Carbon Atoms

FIGURE 16.1: Boiling point of hydrocarbons in Table 16.1 as a function of the number
of carbon atoms in the compound

650

16.1.1

Random Phenomena

Dependent and Independent Variables

Many cases such as the one illustrated above arise in science and engineering where the value taken by one variable appears to depend on the value
taken by another. Not surprisingly, it is customary to refer the variable whose
value depends on the value of another as the dependent variable, while the
other variable is known as the independent variable. It is often desired to capture the relationship between these two variables in some mathematical form.
However, because of measurement errors and other sources of variability, this
exercise requires the use of probabilistic and statistical techniques. Under these
circumstances, the independent variable is considered as a xed, deterministic
quantity that is not subject to random variability. This is perfectly exemplied in n, the number of carbon atoms in the hydrocarbon compounds of Table
16.1; it is a known quantity not subject to random variability. The dependent
variable, on the other hand, is the random variable, subject to a wide variety of potential sources of random variability, including, but not limited to
measurement uncertainties. The dependent variable is therefore represented
as the random variable, Y , while the independent variable is represented as
the deterministic variable, x, represented in the lower case to underscore its
deterministic nature.
The variability observed in the random variable, Y , is typically considered to consist of two distinct components, i.e., for each observation,
Yi , i = 1, 2, . . . , n:
(16.1)
Yi = g(xi ; ) + i
where g(xi ; ) is the deterministic component, a functional relationship, with
as a set of unknown parameters, and i is the random component. The deterministic mathematical relationship between these two variables is a model
of how the independent x (also known as the predictor) aects the the
predictable part of the dependent Y , sometimes known as the response.
In some cases, the functional form of g(xi ) is known from fundamental
scientic principles. For example, if Y is the distance (in cms) traveled in
time ti secs by a particle launched with an initial velocity, u (cm/sec), and
traveling at a constant acceleration a (cm/sec2 ), then we know that
1
g(ti ; u, a) = uti + at2i
2

(16.2)

with = (u, a) as the parameters.

In most cases, however, there is no such fundamental scientic principle to
suggest an appropriate form for g(xi ; ); simple forms (typically polynomials)
are postulated and validated with data, as we show subsequently. The result
in this case is known as an empirical model because it is strictly dependent
on data and not on some known fundamental scientic principle.
Regression analysis to be primarily concerned with the following tasks:
for the model parameters, ;
Obtaining the best estimates

Regression Analysis

651

Characterizing the random sequence i ; and,

Making inference about the parameter estimates, .

The classical treatment is based on least squares estimation which we will
discuss briey now, before using it in the context of regression.

16.1.2

The Principle of Least Squares

Consider the case where the random sample, Y1 , Y2 , . . . , Yn , is drawn from

a population characterized by a single, constant parameter, , the population
mean. The random variable Y may then be written as:
Yi = + i

(16.3)

where the observed random variability is due to random component i . Furthermore, let the variance of Y be 2 . Then from Eq (16.3), we obtain:
E[Yi ] = + E[i ]

(16.4)

and since, by denition, E[Yi ] = , this implies that E[i ] = 0. Furthermore,

V ar(Yi ) = V ar(i ) = 2

(16.5)

since is a constant. Thus, from the fact that Y has a distribution (unspecied) with mean and variance 2 implies that in Eq (16.3), the random
error term, i has zero mean and variance 2 .
To estimate from the given random sample, it seems reasonable to choose
a value that is as close as possible to all the observed data. This concept
may be represented mathematically as:
min S() =

(Yi )2

(16.6)

i=1

The usual calculus approach to this optimization problem leads to:

n

S
=0
=
2
(Yi )
=
i=1
which, when solved, produces the result:
n
Yi

= i=1
n

(16.7)

(16.8)

A second derivative with respect to yields

2S
= 2n > 0
2

(16.9)

652

Random Phenomena

so that indeed S() achieves a minimum for = in Eq (20.3).

The quantity in Eq (20.3) is referred to as a least-squares estimator for
in Eq (16.3), for the obvious reason that the value produced by this estimator
achieves the minimum for the sum-of-squared deviation implied in Eq (16.6).
It should not be lost on the reader that this estimator is also precisely the
same as the familiar sample average.
The problems we have dealt with up until now may be represented in the
form shown in Eq (16.3). In that context, the probability models we developed
earlier may now be interpreted as models for i , the random variation around
the constant random variable mean. This allows us to put the upcoming discussion on the regression problem in context of the earlier discussions.
Finally, we note that the principle of least-squares also aords us the exibility to treat each observation, Yi , dierently in how it contributes to the
estimation of . This is done by applying appropriate weights Wi to Eq (16.3)
to obtain:
(16.10)
Wi Yi = Wi + Wi i
Consequently, for example, more reliable observations can be assigned larger
weights than less reliable ones. Upon using the same calculus techniques, the
least-squares estimate in this case can be shown to be (see Exercise 16.2) to
be:
n
n

Wi2 Yi

=
i Yi
(16.11)
= i=1
n
2
i=1 Wi
i=1
where

W2
i = n i 2
(16.12)
i=1 Wi
Note that 0 < i < 1. The result in Eq 16.11 is therefore an appropriately
weighted average a generalization of Eq (20.3) where i = 1/n. This variation on the least-squares approach is known appropriately as weighted leastsquares; we shall encounter it later in this chapter.

16.2
16.2.1

Simple Linear Regression

One-Parameter Model

As a direct extension of Eq (16.3), let the relationship between the random

variable Y and the independent (deterministic) variable, x, be:
Y = x +

(16.13)

where the random error, , has zero mean and constant variance, 2 . Then,
E(Y |x), the conditional expectation of Y given a specic value for x is:
Y |x = E(Y |x) = x,

(16.14)

Regression Analysis

653

recognizable as the equation of a straight line with slope and zero intercept.
It is also known as the one-parameter regression model, a classic example
of which is the famous Ohms law in physics: the relationship between the
Voltage, V , across a resistor with unknown resistance, R, and the current I
owing through the resistive element, i.e.,
V = IR

(16.15)

From data yi ; i = 1, 2, . . . , n, actual values of the random variable, Yi ,

observed for corresponding values of xi , the problem at hand is to obtain an
estimate of the characterizing parameter . Using the method of least-squares
outlined above requires minimizing the sum-of-squares function:
S() =

(yi xi )2

(16.16)

i=1

from where S/ = 0 yields:

xi (yi xi ) = 0

(16.17)

i=1

which, is solved for to obtain:

n
xi yi
= i=1
n
2
i=1 xi

(16.18)

This is the expression for the slope of the best (i.e., least-squares) straight
line (with zero intercept) through the points (xi , yi ).

16.2.2

Two-Parameter Model

More general is the two-parameter model,

Y = 0 + 1 x +

(16.19)

indicating a functional relationship, g(x; ), that is a straight line with slope

1 and potentially non-zero intercept 0 as the parameters, i.e.,

0
=
(16.20)
1
along with E() = 0; V ar() = 2 . In this case, the conditional expectation
of Y given a specic value for x is given by:
Y |x = E(Y |x) = 0 + 1 x

(16.21)

In this particular case, regression analysis is primarily concerned with obtain = (0 , 1 ); characterizing the random sequence i ;
ing the best estimates for
= (0 , 1 ).
and, making inference about the parameter estimates,

654

Random Phenomena
Y

PY|x= T0 + T1x
H

H
H

FIGURE 16.2: The true regression line and the zero mean random error i
Primary Model Assumption
In this case, the true but unknown regression line is represented by Eq
(16.21), with data scattered around it. The fact that E() = 0, indicates that
the data scatters evenly around the true line; more precisely, the data varies
randomly around a mean value that is the function of x dened by the true
but unknown regression line in Eq (16.21). This is illustrated in Figure 16.2
It is typical to assume that each i , the random component of the model, is
mutually independent of the others and follows a Gaussian distribution with
zero mean and variance 2 , i.e., i N (0, 2 ). The implication in this particular case is therefore that each data point, (xi , yi ), comes from a Gaussian
distribution whose mean is dependent on the value of x, and falls on the true
regression line, as illustrated in Fig 16.3. Equivalently, the true regression line
passes through the mean of the series of Gaussian distributions having the
same variance. The two main assumptions underlying regression analysis may
now be summarized as follows:
1. i forms an independent random sequence, with zero mean and variance
2 that is constant for all x;
2. i N (0, 2 ) so that Yi (0 + 1 x, 2 )
Ordinary Least Squares (OLS) Estimates
Obtaining the least-squares estimates of the intercept, 0 , and slope, 1 ,
from data (xi , yi ) involves minimizing the sum-of-squares function,
S(0 , 1 ) =

n

i=1

[yi (1 xi + 0 )]2

(16.22)

Regression Analysis

655

Y|x = x +

FIGURE 16.3: The Gaussian assumption regarding variability around the true regression line giving rise to N (0, 2 ): The 6 points represent the data at x1 , x2 , . . . , x6 ;
the solid straight line is the true regression line which passes through the mean of the
sequence of the indicated Gaussian distributions
where the usual rst derivatives of the calculus approach yield:
n

S
0

= 2

S
1

= 2

[yi (1 xi + 0 )] = 0

i=1
n

xi [yi (1 xi + 0 )] = 0

(16.23)
(16.24)

i=1

These expressions rearrange to give:

xi + 0 n =

i=1

x2i + 0

i=1

n

i=1
n

i=1

(16.25)

xi yi

(16.26)

i=1

collectively known as the normal equations, to be solved simultaneously to

produce the least squares estimates, 0 , and 1 .
Before solving these equations explicitly, we wish to direct the readers
attention to a pattern underlying the emergence of the normal equations.
Beginning with the original two-parameter model equation:
yi = 1 xi + 0 + i
a summation across each term yields:
n

i=1

y i = 1

xi + 0 n

(16.27)

i=1

where the last term involving i has vanished upon the assumption that n is

656

Random Phenomena

suciently large so that because E(i ) = 0, the sum will be close to zero (a
point worth keeping in mind to remind the reader that the result of solving
the normal equations provide estimates, not precise values).
Also, multiplying the model equation by xi and summing yields:
n

yi xi = 1

i=1

x2i + 0

i=1

(16.28)

i=1

where, once again the last term involving i has vanished because of independence with xi and the assumption, once again, that n is suciently large that
the sum will be close to zero. Note that these two equations are identical to
the normal equations; more importantly, as derived by summation from the
original model they are the sample equivalents of the following expectations:
E(Y ) =
E(Y x) =

1 E(x) + 0
1 E(x2 ) + 0 E(x)

(16.29)
(16.30)

which should help put the emergence of the normal equations into perspective.
Returning to the task of computing least squares estimates of the two
model parameters, let us dene the following terms:
Sxx
Syy
Sxy

=
=
=

n

i=1
n

i=1
n

(xi x
)2

(16.31)

(yi y)2

(16.32)

(xi x
)(yi y)

(16.33)

i=1

n
n
where y = ( i=1 yi )/n and x = ( i=1 xi )/n represent the usual averages.
When expanded out and consolidated, these equations yield:
n
2
n

nSxx = n
x2i
xi
(16.34)
i=1

nSyy
nSxy

= n

n

i=1
n

yi2
xi yi

i=1

yi
i=1
n

(16.35)

n

xi
yi

i=1

(16.36)

i=1

These terms, clearly related to sample variances and covariances, allow us to

solve Eqns (16.25) and (16.26) simultaneously to obtain the results:
1

Sxy
Sxx

(16.37)

y 1 x

(16.38)

Regression Analysis

657

Nowadays, such computations implied in this derivation are no longer carried out by hand, of course, but by computer programs; the foregoing discussion is therefore intended to acquaint the reader with the principles and
mechanics underlying the numbers produced by the statistical software packages.
Maximum Likelihood Estimates
Under the Gaussian assumption, the regression equation, written in the
more general form,
Y = (x, ) + ,
(16.39)
implies that the observations Y1 , Y2 , . . . , Yn come from a Gaussian distribution with mean and variance, 2 ; i.e. Y N ((x, ), 2 ). If the data can
be considered as a random sample from this distribution, then the method of
maximum likelihood presented in Chapter 14 may be used to estimate (x, )
and 2 in precisely the same manner in which estimates of the N (, 2 ) population parameters were determined in Section 14.3.2. The only dierence this
time is that the population mean, (x, ), is no longer constant, but a function of x. It can be shown (see Exercise 16.5) that when the variance 2 is
constant, the maximum likelihood estimate for in the one-parameter model,
(x, ) = x

(16.40)

and the maximum likelihood estimates for (0 , 1 ) in the two-parameter model,

(x, ) = 0 + 1 x

(16.41)

are each identical to the corresponding least squares estimates obtained in

Eq (19.49) and in Eqs (16.38) and (16.37) respectively. It can also be shown
(see Exercise 16.6) that when the variance, i2 , associated with each observation, Yi , i = 1, 2, . . . , n, diers from observation to observation, the maximum
likelihood estimates for the parameters in the rst case, and for (0 , 1 ) in
the second case, are the same as the corresponding weighted least squares
estimates, with weights related to the reciprocal of i .
Actual Regression Line and Residuals
In the same manner in which the true (constant) mean, , of a Gaussian
distribution producing the random sample X1 , X2 , . . . , Xn , is not known, only
the true regression line is also never known
estimated by the sample average X,
but estimated. When the least-squares estimates 0 and 1 are introduced into
the original model, the result is the estimated observation y dened by:
y = 0 + 1 x

(16.42)

This is not the same as the true theoretical Y |x in Eq (16.21) because, in

general 0 = 0 and 1 = 1 ; yi is the two-parameter models best estimate

658

Random Phenomena

TABLE 16.2:
Density (in gm/cc) and
weight percent of ethanol
in ethanol-water mixture
Density
Wt %
(g/cc)
Ethanol
0.99823
0
0.98938
5
0.98187
10
0.97514
15
0.96864
20
0.96168
25
0.95382
30
0.94494
35

(or prediction) of the true but unknown value of the observation yi (unknown
because of the additional random eect, i ). If we now dene as ei , the error
between the actual observation and the estimated value, i.e.,
ei = yi yi ,

(16.43)

this term is known as the residual error or simply the residual; it is our best
estimate of the unknown i , just as y = 0 + 1 x is our best estimate of the
true regression line Y |x = E(Y |x) = 1 x + 0 .
As discussed shortly (Section 16.2.10), the nature of the sequence of residuals provides a great deal of information about how well the model represents
the observations.
Example 16.1: DENSITY OF ETHANOL-WATER MIXTURE
An experimental investigation into how the density of an ethanol-water
mixture varies with weight percent of ethanol in the mixture yielded
the result shown in Table 16.2. Postulate a linear two-parameter model
as in Eq (16.19), and use the supplied data to obtain least-squares estimates of the slope and intercept, and also the residuals. Plot the data
versus the model and comment on the t.
Solution:
Given this data set, just about any software package, from Excel to
MATLAB and MINITAB, will produce the following estimates:
1 = 0.001471; 0 = 0.9975

(16.44)

so that, if y is the density and x is the wt % of ethanol, the regression

model t to this data is given as:
y = 0.001471x + 0.9975

(16.45)

The model t to the data is shown in Fig 16.4; and for the given values

Regression Analysis

659

TABLE 16.3:

Density and weight percent of ethanol

in ethanol-water mixture: model t and residual errors
Density(g/cc) Wt % Ethanol Estimated Residual
Density, y Errors, e
y
x
0.99823
0
0.997500 0.000730
0.990145 -0.000765
0.98938
5
0.98187
10
0.982790 -0.000920
0.97514
15
0.975435 -0.000295
0.968080 0.000560
0.96864
20
0.96168
25
0.960725 0.000955
0.95382
30
0.953370 0.000450
0.946015 -0.001075
0.94494
35

1.00

S
R-Sq
R-Sq(adj)

0.0008774
99.8%
99.8%

0.99
Density = 0.9975 - 0.001471 Wt%Ethanol

Density

0.98
0.97
0.96
0.95
0.94
0

20
Wt%Ethanol

FIGURE 16.4: The tted straight line to the Density versus Ethanol Weight % data:
The additional terms included in the graph, S, R-Sq and R-Sq(adj) are discussed later
of x, the estimated y, and the residuals, e, are shown in Table 16.3.
Visually, the model seems to t quite well.
This model allows us to predict solution density for any given weight
percent of ethanol within the experimental data range but not actually
part of the data. For example, for x = 7.5, Eq (16.45) estimates y =
0.98647. How the residuals are analyzed is discussed in Section 16.2.10.

Expressions such as the one obtained in this example, Eq (16.45), are sometimes known as calibration curves. Such curves are used to calibrate measurement devices such as thermocouples, where the raw instrument output (say
millivolts) is converted to the actual desired measurement (say temperature in

C) based on expressions such as the one obtained here. Such expressions are
typically generated from standardized experiments where data on instrument
output are gathered for various objects with known temperature.

660

16.2.3

Random Phenomena

Properties of OLS Estimators

When experiments are repeated for the same xed values xi , as a typical
consequence of random variation, the corresponding value observed for Yi will
dier each time. The resulting estimates provided in (16.38) and Eqs (16.37)
therefore will also change slightly each time. In typical fashion, therefore, the
specic parameter estimates 0 and 1 are properly considered as realizations
of the respective estimators 0 and 1 , random variables that depend on the
random sample Y1 , Y2 , . . . , Yn . It will be desirable to investigate the theoretical
properties of these estimators dened by:
1
0

Sxy
Sxx
= Y 1 x
=

(16.46)
(16.47)

Let us begin with the expected values of these estimators. From here, we
observe that

Sxy
(16.48)
E(1 ) = E
Sxx
which, from the denitions given above, becomes:
n

1
E
Yi (xi x
)
E(1 ) =
Sxx
i=1

(16.49)

n
) = 0, since Y is a constant); and upon introducing
(because i=1 Y (xi x
Eq (16.19) in for Yi , we obtain:
n

1
E(1 ) =
E
(1 xi + 0 + i )(xi x
)
(16.50)
Sxx
i=1
A term-by-term expansion and subsequent simplication results in

n

1
E(1 ) =
E 1
(xi x
)
Sxx
i=1

(16.51)

n
n
because i=1 (xi x
) = 0 and E[ i=1 i (xi x
)] = 0 since E(i ) = 0. Hence,
Eq (16.51) simplies to
E(1 ) =

1
1 Sxx = 1
Sxx

(16.52)

indicating that 1 is an unbiased estimator of 1 , the true slope.

Similarly, from Eq (16.47), we obtain:
E(0 ) = E(Y 1 x
) = E(Y ) E(1 )
x

(16.53)

Regression Analysis

661

which by virtue of Eq (16.51) simplies to:

E(0 ) = 1 x
+ 0 1 x
= 0

(16.54)

so that 0 is also an unbiased estimator for 0 , the true intercept.

In similar fashion, by denition of the variance of a random variable, it is
straightforward to show that:
2
V ar(1 ) =
1

2
V ar(0 ) =
0

2
Sxx

x2
1
+
2
n Sxx

(16.55)
(16.56)

where 2 is the variance of the random component, . Consequently, the standard error of each estimate, the positive square root of the variance, is given
by:
SE(1 ) =
SE(0 ) =

16.2.4

Sxx
&

1
x
2
+
n Sxx

(16.57)

(16.58)

Condence Intervals

As with all estimation problems, the point estimates obtained above for
the regression parameters, 0 and 1 , by themselves are insucient in making
decisions about their true, but unknown values; we must add a measure of
how precise these estimates are. Obtaining interval estimates is one option; and
such interval estimates are determined for regression parameters essentially by
the same procedure as that presented in Chapter 14 for population parameters.
This, of course, requires sampling distributions.
Slope and Intercept Parameters
Under the Gaussian distributional assumption for , with the implication
that the sample Y1 , Y2 , . . . , Yn , possesses the distribution N (0 + 1 x, 2 ), and
from the results obtained above about the characteristics of the estimates, it
can be shown that the random variables 1 and 0 , respectively the slope
2
2
) and 0 N (0 ,
)
and the intercept, are distributed as 1 N (1 ,
1
0
with the variances as shown in Eqns (16.55) and (16.56), provided the data
variance, 2 , is known. However, this variance is not known and must be
estimated from data. This is done as follows for this particular problem.
Consider residual errors, ei , our best estimates of i ; dene the residual

662

Random Phenomena

error sum of squares as

SSE

=
=

n

i=1
n

(yi yi )2

(16.59)

[yi (1 xi + 0 )]2

i=1

[(yi y) 1 (xi x)]2

(16.60)

i=1

which, upon expansion and simplication reduces to:

SSE = Syy 1 Sxy

(16.61)

E(SSE ) = (n 2) 2

(16.62)

It can be shown that

as a result, the mean squared error, s2e , dened as:
s2e

SSE
=
=
(n 2)

y1 )2
n2

i=1 (yi

(16.63)

is an unbiased estimate of 2 .
Now, as with previous statistical inference problems concerning normal
populations with unknown , by substituting s2e , the mean residual sum-ofsquares, for 2 , we have the following results: the statistics T1 and T0 dened
as:
1 1

T1 =
(16.64)
se / Sxx
and
T0 =

0 0
)

1
x
2
se
+
n
Sxx

(16.65)

each possess t-distribution with = n 2 degrees of freedom. The immediate

implications are therefore that
1
0

=
=

se
1 t/2 (n 2)
Sxx
&
0 t/2 (n 2)se

(16.66)
1
x
2
+
n Sxx

(16.67)

constitute (1 ) 100% condence intervals around the slope and intercept

estimates, respectively.

Regression Analysis

663

Example 16.2: CONFIDENCE INTERVAL ESTIMATES FOR

THE SLOPE AND INTERCEPT OF ETHANOL-WATER
MIXTURE DENSITY REGRESSION MODEL
Obtain 95% condence interval estimates for the slope and intercept of
the regression model obtained in Example 16.1 for the ethanol-water
mixture density data.
Solution:
In carrying out the regression in Example 16.1 with MINITAB, part
of the computer program output is the set of standard errors. In this
case, SE(1 ) = 0.00002708 for the slope, and SE(0 ) = 0.000566 for
the intercept. These could also be computed by hand (although not
recommended). Since the data set consists of 8 data points, we obtain
the required t0.025 (6) = 2.447 from the cumulative probability feature.
The required 95% condence intervals are therefore obtained as follows:
1

0.001471 0.00006607

(16.68)

0.9975 0.001385

(16.69)

Note that none of these two intervals includes 0.

Regression Line
The actual regression line t (see for example Fig 16.4), an estimate of the
true but unknown regression line, is obtained by introducing into Eq (16.21),
the estimates for the slope and intercept parameters to give

Y |x = 1 x + 0

(16.70)

For any specic value x = x , the value

Y |x = 1 x + 0

(16.71)

is the estimate of the actual response of Y at this point (akin to the sample
average estimate of a true but unknown population mean).
In the same manner in which we obtained condence intervals for sample
averages, we can also obtain a condence interval for
Y |x . It can be shown
from Eq (16.71) (and Eq (16.56)) that the associated variance is:

(x x
1
)2
V ar(
Y |x ) = 2
+
(16.72)
n
Sxx
and because of the normality of the random variables 0 and 1 , then if is
known,
Y |x has a normal distribution with mean (1 x + 0 ) and variance
shown in Eq (16.72). With unknown, substituting se for it, as in the previous
section, leads to the result that the specic statistic,
tRL =

(
Y |x Y |x )
)'
(
(x
x )2
1
se
n +
Sxx

(16.73)

664

Random Phenomena

1.00

Regression
95% C I

0.99

S
R-Sq
R-Sq(adj)

Density = 0.9975 - 0.001471 Wt%Ethanol

0.0008774
99.8%
99.8%

Density

0.98
0.97
0.96
0.95
0.94
0

20
Wt%Ethanol

FIGURE 16.5: The tted regression line to the Density versus Ethanol Weight % data
(solid line) along with the 95% condence interval (dashed line). The condence interval
is narrowest at x = x
and widens for values further away from x
.

has a t-distribution with = (n 2) degrees of freedom. As a result, the

(1 ) 100% condence interval on the regression line (mean response) at
x = x , is:
&

1
(x x)2

Y |x = (1 x + 0 ) t/2 (n 2)se
(16.74)
n
Sxx
When this condence interval is computed for all values of x of interest, the
result is a condence interval around the entire regression line. Again, as most
statistical analysis software packages have the capability to compute and plot
this condence interval along with the regression line, the primary objective
of this discussion is to provide the reader with a fundamental understanding
of the theoretical bases for these computer outputs. For example, the 95%
condence interval for the Density-Wt% Ethanol problem in Examples 16.1
and 16.2 is shown in Fig 16.5
)2 term in Eq (16.74), a signature characteristic of
By virtue of the (x x
these condence intervals is that they are narrowest when x = x
and widen
for values further away from x
.

16.2.5

Hypothesis Testing

For this class of problems, the hypothesis of concern is whether or not

there is a real (and signicant) linear functional relationship between x and
Y ; i.e., whether the slope parameter, 1 = 0, in which case the variation in
Y is purely random around a constant mean value 0 (which may or may

Regression Analysis

665

not be zero). This translates to the following hypotheses regarding the slope
parameter:
H0 : 1 = 0
Ha : 1 = 0

(16.75)

And from the preceding discussion regarding condence intervals, the appropriate test statistic for this test, from Eq (16.64), is:
t1 =

se / Sxx

(16.76)

since the postulated value for the unknown 1 is 0; and the decision to reject
or not reject H0 follows the standard two-sided t-test criteria; i.e., at the
signicance level , H0 is rejected when
t1 < t/2 (n 2), or t1 > t/2 (n 2)

(16.77)

As with previous results, these conditions are identical to the (1 ) 100%

condence interval on 1 not containing zero. When there is sucient reason
to reject H0 , the estimated regression coecient is said to be signicant, by
which we mean that it is signicantly dierent from zero, at the signicance
level .
There is nothing to prevent testing hypotheses also about the intercept
parameter, 0 , whether or not its value is signicantly dierent from zero.
The principles are precisely as indicated above for the slope parameter; the
hypotheses are,
H0 : 0 = 0
Ha : 0 = 0

(16.78)

in this case, with the test statistic (from Eq (16.65)):

)

t0 =
se

0
1
n

x
2
Sxx

(16.79)

and the rejection criteria,

t0 < t/2 (n 2), or t0 > t/2 (n 2)

(16.80)

In addition to computing estimates of the regression coecient and the

associated standard errors, most computer programs will also compute the
t-statistics and the associated p-values for each of the two coecients.
Let us illustrate with the following example.

666

Random Phenomena

TABLE 16.4:
individuals
Cranial
Circum (cms)
Finger
Length (cms)
Cranial
Circum (cms)
Finger
Length (cms)

Cranial circumference and nger lengths for 16

58.5 54.2 57.2 52.7 55.1 60.7 57.2 58.8
7.6

7.9

8.4

7.7

8.6

7.9

8.2

56.2 60.7 53.5 60.7 56.3 58.1 56.6 57.7

7.7

8.1

7.9

8.1

8.2

7.8

7.9

Example 16.3: CRANIAL CIRCUMFERENCE AND FINGER LENGTH

A once-popular exercise in the late 19th and early 20th centuries involved
attempts at nding mathematical expressions that will allow one to predict, for a population of humans, some physical human attribute on the
basis of a dierent one. The data in Table 16.4 shows the result of a
classic example of such an exercise where the cranial circumference (in
cms) and the length of the longest nger (in cms) of 16 individuals were
determined. Postulate a linear two-parameter model as in Eq (16.19),
obtain least-squares estimates of the slope and intercept, and test hypotheses that these parameters are not signicantly dierent from zero.
Plot the data versus the model t and comment on the results.
Solution:
If Y is the cranial circumference and x, the nger length, using
MINITAB to analyze this data set produces the following results:
Regression Analysis: Cranial Circ(cm) versus Finger Length(cm)
The regression equation is
Cranial Circ(cm) = 43.0 + 1.76 Finger Length(cm)
Predictor
Coef SE Coef
T
P
Constant
43.00
17.11
2.51 0.025
Finger Length 1.757
2.126
0.83 0.422
S = 2.49655 R-Sq = 4.7% R-Sq(adj) = 0.0%
Thus, the regression equation is obtained as
y = 1.76x + 43.0

(16.81)

and the model t to the data is shown in Fig 16.6. (Again, we defer
until the appropriate place, any comment on the terms included in the
last line of the MINITAB output.)
It is important to note how, rather than clustering tightly around the
regression line, the data shows instead a signicant amount of scatter,
which, at least visually, calls into question the postulated dependence
of cranial circumference on nger length. This question is settled concretely by the computed T statistics for the model parameters and the

Regression Analysis

667

Fitted Line Plot

Cranial Circ(cm) = 43.00 + 1.757 Finger Length(cm)
61

S
R-Sq

R-Sq(adj)

2.49655
4.7%
0.0%

Cranial Circ(cm)

59
58
57
56
55
54
53
52
7.50

7.75

8.00
8.25
Finger Length(cm)

8.50

8.75

FIGURE 16.6: The tted straight line to the Cranial circumference versus Finger
length data. Note how the data points are widely scattered around the tted regression
line. (The additional terms included in the graph, S, R-Sq and R-Sq(adj) are discussed
later)
associated p-values. The p-value of 0.025 associated with the constant
(intercept parameter, 0 ) indicates that we must reject the null hypothesis that 0 = 0 in favor of the alternative that the estimated value,
43.0, is signicantly dierent from zero, at the 5% signicance level.
On the other hand, the corresponding p-value associated with the 1 ,
the coecient of x, the nger length (i.e., the regression line slope), is
0.422, indicating that there is no evidence to reject the null hypothesis.
Thus, at the 5% signicance level, 1 is not signicantly dierent from
zero and we therefore conclude that there is no discernible relationship
between cranial circumference and nger length.
Thus, the implication of the signicance of the constant term, and
non-signicance of the coecient of the nger length is two-fold: (i)
that cranial circumference does not depend on nger length (at least
for the 16 individuals in this study), so that the observed variability is
purely random, with no systematic component that can be explained
by nger length; and consequently, (ii) that the cranial circumference is
best characterized for this population of individuals by the mean value
(43.0 cm), a value that is signicantly dierent from zero (as one would
certainly expect!)

This last example illustrates an important point about regression analysis:

one can always t any postulated model to any given set of data; the real
question is: how useful is this model? In other words, to what extent is
the implied relationship between x and Y representative of the real information contained in the data? These are very important questions that will be

668

Random Phenomena

answered systematically in the upcoming sections. For now, we note that at

the most basic level, the hypothesis tests discussed here provide an objective
assessment of the implied relationship, whether it is real or it is merely an
artifact of random variability. Anytime we are unable to reject the null hypothesis on the slope parameter, the estimated value, and hence the model itself,
are not signicant; i.e., the real parameter value cannot be distinguished
from zero.

16.2.6

Prediction and Prediction Intervals

A model whose parameters are conrmed as signicant is useful for at

least two things:
1. Estimating Mean Responses:
For a given value x = x , the tted regression line provides a means of
estimating the expected value of the response, Y |x ; i.e.,
E(Y |x ) = 1 x + 0

(16.82)

This is to be understood as the least-squares surrogate of the average

of a number of replicate responses obtained when the experiment is
repeated for the same xed value x = x .
2. Predicting a New Response:
Here, the objective is slightly dierent: for a given x = x , we wish to
predict the response observed from a single experiment performed at the
specied value. Not surprisingly, the tted regression line provides the
best prediction, y(x ) as
y(x ) = 1 x + 0

(16.83)

which is precisely the same as above in Eq (16.82). The dierence lies

not in the value themselves but in the precision associated with each
value.
When the regression line is used as an estimator of mean (or expected) response, the precision associated with the estimate was given in the form of the
variance shown in Eq (16.72), from which we developed the condence interval
around the regression line. When the regression line is used as a prediction of
a yet-to-be-observed value Y (x ), however, the prediction error is given by:
Ep = Y (x ) Y (x )

(16.84)

Regression Analysis

669

2
which, under the normality assumption, possesses the distribution N (0, E
),
p
with the variance obtained from Eq (16.84) as
2
E
p

= V ar[Y (x )] + V ar[Y (x )]

(x x
)2
2
2 1
= +
+
n
Sxx

x)2
1
2
= 1+ +
n
Sxx

(16.85)

where, we recall, 2 as the variance of the random error component, .

This expression diers from the expression in Eq (16.72) by the presence
2
of the additional term, 1, making E
> V ar(
Y |x ) always. This mathematp
ical fact is a consequence of the phenomenological fact that the prediction
error is a combination of the variability inherent in determining the observed
value, Y (x ), and the regression model error associated with Y (x ), this latter
quantity being the only error associated with using the regression model as
an estimator of mean response.
We may now use Eq (16.85) to obtain the (1)100% prediction interval
for y(x ) by substituting the data estimate, se , for the unknown standard
deviation, , and from the resulting t-distribution characteristics, i.e.,
&

(x x
)2
1
1+ +
y(x ) = y(x ) t/2 (n 2)se
n
Sxx

(16.86)

As with the condence intervals around the regression line, these prediction
and widen for values further away from
intervals are narrowest when x = x
x
, but they are consistently wider than the condence intervals.
Example 16.4: HIGHWAY GASOLINE MILEAGE AND ENGINE CAPACITY FOR TWO-SEATER AUTOMOBILES
From the data shown in Table 12.5 of Chapter 12 on gasoline mileage for
a collection of two-seater cars, postulate a linear two-parameter model
as in Eq (16.19), for highway mileage (y) as a function of the engine
capacity, x; obtain least-squares estimates of the parameters for all the
cars, leaving out the Chevrolet Corvette and the Dodge Viper data
(these cars were identied in that chapter as dierent from the others
in the class because of the material used for their bodies). Show a plot
of the tted regression line, the 95% condence interval and the 95%
prediction interval.
Solution:
Using MINITAB for this problem produces the following results:
Regression Analysis: MPGHighway versus EngCapacity
The regression equation is

670

Random Phenomena
Fitted Line Plot
MPGHighway = 33.15 - 2.739 EngCapacity
35

Regression
95% CI
95% PI

MPG Highway

S
R-Sq
R-Sq(adj)

1.95167
86.8%
86.0%

20
15
10
5
1

4
5
6
Engine Capacity (Liters)

FIGURE 16.7: The tted straight line to the Highway MPG versus Engine Capacity
data of Table 12.5 (leaving out the two inconsistent data points) along with the 95%
condence interval (long dashed line) and the 95% prediction interval (short dashed
line). (Again, the additional terms included in the graph, S, R-Sq and R-Sq(adj) are
discussed later).

MPGHighway = 33.2 - 2.74 EngCapacity

Predictor
Coef
SE Coef
T
Constant
33.155
1.110
29.88
EngCapacity -2.7387
0.2665
-10.28

P
0.000
0.000

S = 1.95167 R-Sq = 86.8% R-Sq(adj) = 86.0%

Thus, with Highway MPG as y, and Engine Capacity as x, the tted
regression line equation is
y = 2.74x + 33.2

(16.87)

and, since the p-values associated with each parameter are both zero
to three decimal places, we conclude that these parameters are signicant. The implication is that for every liter increase in engine capacity,
the average two-seater car is expected to lose about 2 and 3/4 miles per
gallon on the highway. (As before, we defer until later any comment on
the terms in the last line of the MINTAB output.)
The model t to the data is shown in Fig 16.7 along with the required
95% condence interval (CI) and the 95% prediction interval (PI). Note
how much wider the PI is than the CI at every value of x.

Regression Analysis

16.2.7

671

Coecient of Determination and the F-Test

Beyond hypotheses tests to determine the signicance of individual estimated parameters, other techniques exist for assessing the overall eectiveness
of the regression model, based on measures of how much of the total variability
in the data has been captured (or explained) by the model.
Orthogonal Decomposition of Variability
n
The total variability present in the data, represented by i=1 (yi y)2 ,
and dened as Syy in Eq (16.32), may be rearranged as follows, merely by
adding and subtracting yi :
Syy

(yi y)2

i=1

[(yi yi ) (
y yi )]

(16.88)

i=1

Upon expanding and simplifying (see Exercise 16.9), one obtains the very
important expression:
n

(yi y)2

i=1

(
yi y)2 +

i=1

or Syy

(yi yi )2

i=1

= SSR + SSE

(16.89)

where we have recalled that the second term on the RHS of the equation is
the residual error sum of squares dened in Eq (16.59), and have introduced
the term SSR to represent the regression sum of squares,
SSR =

(
yi y)2

(16.90)

i=1

a measure of the variability represented in the regression lines estimate of the

mean response. The expression in Eq (16.89) represents a decomposition of
the total variability in the data into two components: the variability captured
by the regression model, SSR , and what is left in the residual error, SSE .
In fact, we had actually encountered this expression earlier, in Eq (16.61),
where what we now refer to as SSR had earlier been presented as 1 Sxy . If
the data vector is represented as y, the corresponding vector of regression
, and the vector of residual errors between these two as
model estimates as y
, then observe that
e=yy
(y y) = (
y y) + e

(16.91)

672

Random Phenomena

But from the denition of vector Euclidian norms,

||(y y)||2
2

||(
y y)||
||e||2

Syy

(16.92)

=
=

SSR
SSE

(16.93)
(16.94)

with the very important implication that, as a result of the vector representation in Eq (19.11), the expression in Eq (16.89) is an orthogonal decomposition
of the data variance vector reminiscent of Pythagoras Theorem. (If the vector
sum in Eq (19.11) holds simultaneously as the corresponding sums of squares
expression in Eq (16.89), then the vector (
y y) must be orthogonal to the
vector e.)
Eq (16.89) is in fact known as the analysis of variance (ANOVA) identity;
and it plays a central role in statistical inference that transcends the restricted
role observed here in regression analysis. We shall have cause to revisit this
subject in our discussion of the design of experiments in upcoming chapters.
For now, we use it to assess the eectiveness of the overall regression model (as
a single entity purporting to represent the information contained in the data),
rst in the form of the coecient of determination, and later as the basis for
an F -test of signicance. This latter exercise will constitute a preview of an
upcoming, more general discussion of ANOVA.
R2 , The Coecient of Determination
Let us now consider the ratio dened as:
R2 =

SSR
Syy

(16.95)

which represents the proportion of the total data variability (around the mean
y) that has been captured by the regression model; its complement,
1 R2 = SSE /Syy

(16.96)

is the portion left unexplained by the regression model. Observe that 0 R2

1, and that if a model adequately captures the relevant information contained
in a data set, what will be left unexplained as random variation should be
comparatively small, so that the R2 value will be close to 1. Conversely, a value
close to zero indicates a model that is inadequate in capturing the important
variability present in the data. R2 is therefore known as the coecient of
determination; it is a direct measure of the quality of t provided by the
regression model.
Although not directly relevant yet at this point (where we are still discussing the classical two-parameter model), it is possible to improve a model
t by introducing additional parameters. Under such circumstances, the improvement in R2 may come at the expense of over-tting (as discussed more

Regression Analysis

673

fully later). A somewhat more judicious assessment of model adequacy requires adjusting the value of R2 to reect the number of parameters that have
been used by the model to capture the variability.
By recasting the expression in Eq (16.95) in the equivalent form:
R2 = 1

SSE
Syy

(16.97)

rather than base the metric on the indicated absolute sums of squares, consider
using the mean sums of squares instead. In other words, instead of the total
residual error sum of squares, SSE , we employ instead the mean residual error
sum of squares, SSE /(np) where p is the number of parameters in the model
and n is the total number of experimental data points; also instead of the total
data sum of squares, Syy , we employ instead the data variance Syy /(n 1).
2
, and dened as:
The resulting quantity, known as Radj
2
Radj
= 1

SSE /(n p)
Syy /(n 1)

(16.98)

is similar to the coecient of determination, R2 , but it is adjusted for the

the number of parameters contained in the model. It penalizes models that
achieve decent values of R2 via the use of an excessive number of parameters.
2
Relatively high values of R2 and Radj
that are also comparable in magnitude
indicate a model that is quite adequate: the variability in the data has been
captured adequately without using an excessive number of parameters.
All software packages that carry out regression analysis routinely compute
2
, sometimes presented not as fractions (as indicated above), but
R2 and Radj
multiplied by 100%. In fact, all the examples and tted regression line plots
encountered thus far in this chapter have shown these values (in percentage
form) but we had to defer commenting on them until now. We are only now
in a position for such a discussion.
In Figs 16.4, 16.5, 16.6 and 16.7, the value
shown for S is the square root
of the mean residual sum of squares, i.e., SSE /(n 2), an estimate of the
unknown data standard deviation, ; this is accompanied by values for R2
2
and Radj
. Thus, in the Density-Ethanol weight percent regression model, (Fig
2
16.4), both R2 and Radj
are reported as 99.8%, indicating a model that appears
to have explained virtually all the variability in the data, with very little left
by way of the residual error (as indicated by the very small value of S). The
exact opposite is the case with the Cranial circumference versus Finger length
2
vanishes entirely (a
regression model: R2 is an incredibly low 4.7% and the Radj
perfect 0.00%), indicating that (a) the model has explained very little of the
variability in the data, and (b) when penalized for the parameters employed
in achieving even the less that 5% variability captured, the inadequacy of the
model is seen to be total. The residual data standard deviation, S, is almost
2.5. With the Highway gas mileage versus Engine capacity regression model,
the R2 value is reasonably high at 86.8%, with an adjusted value of 86% that

674

Random Phenomena

is essentially unchanged; the residual standard of S = 1.95 is also reasonable.

The indication is that while there is still some unexplained variability left,
the regression model captures a signicant amount of the variability in the
data and provides a reasonable mathematical explanation of the information
contained in the data.
F-test for Signicance of Regression
Let us return to the ANOVA expression in Eq (16.89);
n

(yi y)

i=1

(
yi y) +

i=1

i.e Syy

(yi yi )2

i=1

= SSR + SSE

and note the following:

n
y)2 , has (n 1) degrees of
1. The total sum of squares, Syy = i=1 (yi
freedom; the error sum of squares, SSE = ni=1 (yi yi )2 , has
n(n 2)
degrees of freedom, and the regression sum of squares, SSR = i=1 (
yi
y)2 , has 1 degree of freedom.
2. One informal way to conrm this fact is as follows: (i) of the n independent units of information in the raw data, yi ; i = 1, 2, . . . , n, one degree
of freedom is tied up in obtaining the average, y, so that (yi y)
will have (n 1) degrees of freedom left (i.e., there are now only (n 1)
independent quantities in (yi y)); (ii) similarly, 2 degrees of freedom
are tied up in obtaining the response estimate y (via the two parameters, 0 and 1 ), so that (yi yi ) has (n 2) degrees of freedom left;
and nally (iii) while yi ties up 2 degrees of freedom, y ties up one,
so that (
yi y) has one degree of freedom left.
3. The implication is therefore that, in addition to representing a decomposition of variability, since it is also true that
(n 1) = 1 + (n 2)

(16.99)

Eq (16.89) also represents a concurrent decomposition of the degrees of

freedom associated with each sum of squares.
Finally, from the following results (given without proof; e.g., Eq (16.62)),
E(SSE )

= (n 2) 2

E(SSR )

2
= 1 Sxx + 2

(16.100)

we arrive at the following conclusions: Under the null hypothesis H0 : 1 = 0,

these two equations suggest SSE /(n 2) and SSR /1 (respectively the error
mean square, M SE and the regression mean square, M SR ) as two separate

Regression Analysis

TABLE 16.5:

ANOVA Table for

of Regression
Source of
Sum of Degrees of
Variation
Squares Freedom
Regression
SSR
1
Error
SSE
(n 2)
Total
Syy
(n 1)

675
Testing Signicance
Mean
Square
M SR
M SE

MSR
MSE

and distinct estimators of 2 . Furthermore, under the normality assumption

for yi , then the statistic
SSR /1
(16.101)
F =
SSE /(n 2)
will possess an F (1 , 2 ) distribution, with 1 = 1, and 2 = (n 2), if H0 is
true that 1 = 0. However, if H0 is not true, then the numerator in Eq (16.101)
2
will be inated by the term 1 Sxx as indicated in Eq (16.100). Hence, at the
signicance level of , we reject H0 (that the regression as a whole is not
signicant), when the actual computed statistic
f > f (1 , 2 )

(16.102)

where f (1 , 2 ) is the usual F (1 , 2 )-distribution variate with upper tail area

. Equivalently, one computes the p-value associated with the computed f , as
p = P (F f )

(16.103)

and reject or fail to reject the null hypothesis on the basis of the actual p-value.
These results are typically presented in what is referred to as an ANOVA
Table as shown in Table 16.5. They are used to carry out F -tests for the
signicance of the entire regression model as a single entity; if the resulting
p-value is low, we reject the null hypothesis and conclude that the regression is
signicant, i.e., the relationship implied by the regression model is meaningful.
Alternatively, if the p-value exceeds a pre-specied threshold, (say, 0.05), we
fail to reject the null hypothesis and conclude that the regression model is not
signicant that the implied relationship is purely random.
All computer programs that perform regression analysis produce such
ANOVA tables. For example, the MINITAB output for Example 16.4 above
(involving the regression model relating engine capacity to the highway mpg
rating) includes the following ANOVA table.
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
1 402.17 402.17 105.58 0.000
Residual Error 16 60.94
3.81
Total
17 463.11

676

Random Phenomena

The indicated p-value of 0.000 implies that we must reject the null hypothesis
and conclude that the regression model is signicant.
On the other hand, the ANOVA table produced by MINITAB for the cranial circumference versus nger length regression problem of Example 16.3 is
as shown below:
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
1
4.259 4.259 0.68 0.422
Residual Error 14 87.259 6.233
Total
15 91.517
In this case, the p-value associated with the F -test is so high (0.422) that we
reject the null hypothesis and conclude that the regression is not signicant.
Of course, these conclusions agree perfectly with our earlier conclusions
concerning each of these problems.
In general, we tend to de-emphasize these ANOVA-based F -tests for signicance of the regression. This is for the simple reason that they are coarse
tests of the overall regression model, adding little or nothing to the individual
t-tests presented earlier for each parameter. These individual parameter tests
are preferred because they are ner-grained.
From this point on, we will no longer refer to these ANOVA tests of signicance for regression. Nevertheless, these same concepts take center stage
in Chapter 19 where they are central to analysis of designed experiments.

16.2.8

Relation to the Correlation Coecient

In Chapter 5, we dened the correlation coecient between two jointly

distributed random variables X and Y as:
=

E[XY ]
X Y

(16.104)

The sample version, obtained from data, known as the Pearson product moment correlation coecient, is:
n
(xi x)(yi y)
n
(16.105)
r = n i=1
)2
)2
i=1 (xi x
i=1 (yi y
If we now recall the expressions in Eqs (16.3116.33), we immediately obtain,
in the context of regression analysis:
r=

Sxy

Sxx Syy

(16.106)

And now, in terms of the slope parameter estimate, we obtain from Eq (16.37),
rst that

Sxx

(16.107)
r = 1
Syy

Regression Analysis

677

an expression we shall return to shortly. For now, let us return to the expression for R2 in Eq (16.97); if we introduce Eq (16.61) for SSE , we obtain
R2 = 1

Sxy
Syy 1 Sxy
= 1
Syy
Syy

(16.108)

Upon introducing Eq (16.37) for 1 , we obtain the result that:

R2 =

(Sxy )2
Sxx Syy

(16.109)

which, when compared with Eq (16.106) establishes the important result that
R2 , the coecient of determination, is the square of the sample correlation
coecient, r; i.e.,
R2 = r 2 .
(16.110)

16.2.9

Mean-Centered Model

As obtained previously, the estimated observation from the regression

model is given by
(16.111)
y = 0 + 1 x
and, from a rearrangement of Eq (16.38) used to estimate 0 , we obtain
y = 0 + 1 x

(16.112)

If we now subtract the latter equation from the former, we obtain:

(
y y) = 1 (x x
)

(16.113)

which is a mean-centered version of the regression model.

If we now rearrange Eq (16.107) to express 1 in terms of r and introduce
it into Eq (16.113), we obtain:

sy
(x x
)
(16.114)
(
y y) = r
sx

where sy = Syy /(n 1) and sx = Sxx /(n 1), are, respectively sample
estimates of the data standard deviation for y and for x. Alternatively, Eq
(16.114) could equivalently be written as
&

Syy
(x x
)
(16.115)
(
y y) = R2
Sxx
This equation provides the clearest indication of the impact of R2 on how
strongly the mean-centered value of the predictor, x, is connected by the

678

Random Phenomena

model to and hence can be used to estimate the mean-centered response. Observe that in the density and weight percent ethanol example, with
R2 = 0.998, the connection between the predictor and response estimate is
particularly strong; with the cranial circumference-nger length example, the
connection is extremely weak, and the best estimate of the response (cranial
circumference) for any value of the predictor (nger length), is essentially the
mean value, y.

16.2.10

Residual Analysis

While the statistical signicance of the estimated parameters gives us some

2
values
information about the usefulness of a model, and while the R2 and Radj
provide a measure of how much of the datas variability has been captured
by the overall model, how well the model represents the data is most directly
determined from the dierence between the actual observation, yi , and the
model estimate, yi , i.e., ei = yi yi , i = 1, 2, . . . , n. These quantities, identied
earlier as residual errors (or simply as residuals), provide n dierent samples
of how closely the model matches the data. If the models representation of the
data is adequate, the residuals should be nothing but purely random variation.
Any departure from pure random variation in the residuals is an indication
of some form or the other of model inadequacy. Residual analysis therefore
allows us to do the following:
1. Check model assumptions, specically that N (0, 2 ), with 2 constant;
2. Identify any left over systematic structural discrepancies between the
model and data; and
3. Identify which data points might be inconsistent with the others in the
set.
By formulation,
nthe least squares estimation technique always produces a
model for which i=1 ei = 0, making the desired zero mean characteristics of
the model error sequence a non-issue. Residual analysis is therefore concerned
mostly with the following activities:
1. Formal and informal tests of normality of ei ;
2. Graphical/visual evaluation of residual patterns;
3. Graphical/visual and numerical evaluation of individual residual magnitude.
Other than formal normality tests (which involve techniques discussed in the
next chapter) residual analysis is more of an art involving various graphical
plots. When there is sucient data, histograms of the residuals provide great
visual clues regarding normality quite apart from what formal tests show. In

Regression Analysis

679

TABLE 16.6:

Thermal
conductivity measurements at various
temperatures for a metal
k (W/m- C) Temperature ( C)
93.228
100
92.563
150
99.409
200
101.590
250
111.535
300
115.874
350
119.390
400
126.615
450

many cases, however, available data is usually modest. Regardless of the data
size, plots of the residuals themselves versus the tted value, y, or versus data
order, or versus x, are not only capable of indicating model adequacy; they
also provide clues about the nature of the implied model inadequacy.
It is often recommended that residual plots be based not on the residual
themselves, but on the standardized residual,
ei =

ei
se

(16.116)

where, se , as we recall, is the estimate of , the data standard deviation. This

is because if the residuals are truly normally distributed, then 2 < ei < 2
for approximately 95% of the standardized residuals; and some 99% should lie
between 3 and 3. As a general rule-of-thumb, therefore, |ei | > 3 indicates a
value that is considered a potential outlier because it is inconsistent with
the others.
When a model appears inadequate, the recommendation is to look for clues
within the residuals for what to do next. The following examples illustrate
residual analysis for practical problems in engineering.
Example 16.5: TEMPERATURE DEPENDENCE OF THERMAL CONDUCTIVITY
To characterize how the thermal conductivity, k (W/m- C), of a metal
varies with temperature, eight independent experiments were performed
at the temperatures, T C, shown in Table 16.6 along with the measured
thermal conductivities. A two-parameter model as in Eq 16.19 has been
postulated for the relationship between k and T . Obtain a least-squares
estimate of the parameters and evaluate the model t to the data.
Solution:
We use MINITAB and obtain the following results:
Regression Analysis: k versus Temperature
The regression equation is

680

Random Phenomena
k = 79.6 + 0.102 Temperature
Predictor
Coef
SE Coef
Constant
79.555
2.192
Temperature 0.101710 0.007359

T
36.29
13.82

P
0.000
0.000

S = 2.38470 R-Sq = 97.0% R-Sq(adj) = 96.4%

Therefore, as before, representing the thermal conductivity as y, and
Temperature as x, the tted regression line equation is
y = 0.102x + 79.6

(16.117)

The p-values associated with each parameter is zero, implying that both
parameters are signicantly dierent from zero. The estimate of the data
2
values indicate
standard deviation is shown as S; and the R2 and Radj
that the model captures a reasonable amount of the variability in the
data.
The actual model t to the data is shown in the top panel of Fig
16.8 while the standardized residuals versus tted value, yi , is shown
in the bottom panel. With only 8 data points, there are not enough
residuals for a histogram plot. Nevertheless, upon visual examination
of the residual plots, there appears to be no discernible pattern, nor is
there any reason to believe that the residuals are anything but purely
random. Note that no standardized residual value exceeds 2.
The model is therefore considered to provide a reasonable representation of how the thermal conductivity of this metal varies with
temperature.

The next example illustrates a practical circumstance where the residuals

not only expose the inadequacy of a linear regression model, but also provide
clues concerning how to rectify the inadequacy.
Example 16.6: BOILING POINT OF HYDROCARBONS
It has been proposed to represent with a linear two-parameter model,
the relationship between the number of carbon atoms in the hydrocarbon compounds listed in Table 16.1 and the respective boiling points.
Evaluate a least-squares t of this model to the data.
Solution:
Using MINITAB produces the following results for this problem:
Regression Analysis: Boiling Point versus n
The regression equation is
Boiling Point = - 172.8 + 39.45 n
Predictor
Coef
SE Coef
T
P
Constant
-172.79
13.26
-13.03 0.000
n
39.452
2.625
15.03 0.000
S = 17.0142 R-Sq = 97.4 % R-Sq(adj) = 97.0%

Regression Analysis

681

Fitted Line Plot

k = 79.56 + 0.1017 Temperature
130

S
R-Sq
R-Sq(adj)

2.38470
97.0%
96.4%

120

110

100

90
100

200

300
Temperature, Deg C

400

500

Versus Fits
(response is k)

Standardized Residual

-1

-2
90

100

110
Fitted Value

120

130

FIGURE 16.8: Modeling the temperature dependence of thermal conductivity: Top:

Fitted straight line to the Thermal conductivity (k) versus Temperature (T C) data in
Table 16.6; Bottom: standardized residuals versus tted value, yi .

682

Random Phenomena
Therefore, as before, the tted regression line equation is
y = 39.45x 172.8

(16.118)

with the hydrocarbon compound boiling point as y, and the number of

carbon atoms it contains as x.
We notice, once again, that for this model, the parameter values are
all signicantly dierent from zero because the p-values are zero in each
2
values are quite good. The error
case; furthermore, the R2 and Radj
standard deviation is obtained as S = 17.0142. By themselves, nothing
in these results seem out of place; one might even be tempted by the
2
values to declare that this is a very good model.
excellent R2 and Radj
However, the model t to the data, shown in the top panel of Fig 16.9,
tells a dierent story; and the normalized residuals versus tted value,
yi , shown in the bottom panel, is particularly revealing. The model t
shows a straight line model that very consistently overestimates BP at
the extremes and underestimates it in the middle. The standardized
residual versus model t quanties this under- and over-estimation and
shows a clear left over quadratic structure.
The implication clearly is that while approximately 97% of the relationship between n, the number of carbon atoms, and the hydrocarbon
BP has been captured by a linear relationship, there remains an unexplained, possibly quadratic, component that is clearly discernible. The
suggestion: consider a revised model of the type
Y = 0 + 1 x + 2 x 2

(16.119)

Such a model is a bit more complicated, but the residual structure seems
to suggest that the additional term is warranted. How to obtain a model
of this kind is discussed shortly.

We revisit the problem illustrated in this example after completing a discussion

of more complicated regression models in the upcoming sections.

16.3
16.3.1

Intrinsically Linear Regression

Linearity in Regression Models

In estimating the vector of parameters contained in the general regression

model,
Y = g(x; ) +
it is important to clarify what the term linear in linear regression refers to.
While it is true that the model:
Y = 0 + 1 x +

Regression Analysis

683

Fitted Line Plot

Boiling Point = - 172.8 + 39.45 n
150

S
R-Sq
R-Sq(adj)

Boiling Point, Deg C

100

17.0142
97.4%
97.0%

50
0
-50
-100
-150
-200
0

2
3
4
5
6
n, Number of Carbon Atoms

Versus Fits
(response is Boiling Point)
1.0

Standardized Residual

0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
-150

-100

-50

0
Fitted Value

100

150

FIGURE 16.9: Modeling the dependence of the boiling points (BP) of hydrocarbon
compounds in Table 16.1 on the number of carbon atoms in the compound: Top: Fitted straight line of BP versus n, the number of carbon atoms; Bottom: standardized
residuals versus tted value yi . Notice the distinctive quadratic structure left over
in the residuals exposing the linear models over-estimation at the extremes and the
under-estimation in the middle.

684

Random Phenomena

is a linear equation because it represents a straight line relationship between Y

and x, what is actually of relevance in regression analysis is that this functional
form is linear with respect to the unknown parameter vector = (0 , 1 ). It
must be kept in mind in regression, that x is known and given; the parameters
are the unknowns to be determined by the regression exercise.
Thus, what is of importance in determining whether a regression problem
is linear or not is the functional form of g(x; ) with respect to the vector of
parameters , not with respect to x. For example, if the regression model is
given as
Y = 1 xn + ,
(16.120)
clearly this is a nonlinear function of x; however, so long as n is known, for
any given value of x, xn is also known, say xn = z; this equation is therefore
exactly equivalent to
Y = 1 z + ,
(16.121)
which is clearly linear. Thus, even though nonlinear in x, Eq (16.120) nevertheless represents a linear regression problem because the model equation is
linear in the unknown parameter . On the other hand, the model representing how the concentration C(t) of a reactant undergoing a rst order kinetic
reaction in a batch reactor changes with time,
C(t) = 0 e1 t

(16.122)

with time, t, as the independent variable, along with 0 and 1 respectively as

the unknown initial concentration and kinetic reaction rate constant, represents a truly nonlinear regression model. This is because one of the unknown
parameters, 1 , enters the model nonlinearly; the model is linear in the other
parameter 0 .
As far as regression is concerned, therefore, whether the problem at hand
is linear or nonlinear, depends on whether the parameter sensitivity function,
Si =

g
i

(16.123)

is a function of i or not. For linear regression problems Si is independent

of i for all i; the dening characteristics of nonlinear regression is that Si
depends on i for at least one i.
Example 16.7: LINEAR VERSUS NONLINEAR REGRESSION PROBLEMS
Which of the following three models presents a linear or nonlinear regression problem in estimating the indicated unknown parameters, i ?
(1) Y

0 + 1 x + 2 x 2 + 3 x 3 +

(16.124)

(2) Y

1 2 x +

(16.125)

(3) Y

1 ex1 + 2 ln x2 +

3
+
x3

(16.126)

Regression Analysis

685

Solution:
Model (1) presents a linear regression problem because each of the sensitivities, S0 = 1; S1 = x; S2 = x2 ; and S3 = x3 , is free of the
unknown parameter on which it is based. i.e., Si is not a function of
i for i = 0, 1, 2, 3. In fact, all the sensitivities are entirely free of all
parameters.
Model (2) on the other hand presents a nonlinear regression problem:
S1 = 2 x does not depend on 1 , but
S2 = 1 x2 x1

(16.127)

depends on 2 . Thus, while this model is linear in 1 (because the sensitivity to 1 does not depend on 1 ), it is nonlinear in 2 ; therefore, it
presents a nonlinear regression problem.
Model (3) presents a linear regression problem: S1 = ex1 ; S2 =
ln x2 , are both entirely free of unknown parameters.

16.3.2

Variable Transformations

A restricted class of truly nonlinear regression models may be converted

to linear models via appropriate variable transformations; linear regression
analysis can then be carried out in terms of the transformed variables. For
example, observe that even though the reactant concentration model in Eq
(16.122) has been identied as nonlinear in the parameters, a logarithmic
transformation results in:
ln C(t) = ln 0 1 t

(16.128)

In this case, observe that if we now let Y = ln C(t), and let 0 = ln 0 , then
Eq (16.128) represents a linear regression model.
Such cases abound in chemical engineering. For example, the equilibrium
vapor mole fraction, y, as a function of liquid mole fraction, x, of a compound
with relative volatility is given by the expression:
y=

x
1 + ( 1)x

(16.129)

It is an easy exercise to show that by inverting this equation, we obtain:

1
1
= + (1 )
y
x

(16.130)

so that 1/y versus 1/x produces a linear regression problem.

Such models are said to be intrinsically linear because while non linear
in their original variables, they are linear in a dierent set of transformed variables; the task is to nd the required transformation. Nevertheless, the careful
reader would have noticed something missing from these model equations: we
have carefully avoided introducing the error terms, . This is for the simple

686

Random Phenomena

reason that in virtually all cases, if the error term is additive, then even the
obvious transformations are no longer possible. For example, if each actual
concentration measurement, C(ti ), observed at time ti has associated with it
the additive error term i , then Eq (16.122) must be rewritten as
C(ti ) = 0 e1 ti + i

(16.131)

and the logarithmic transformation is no longer possible.

Under such circumstances, most practitioners will suspend the addition
of the error term until after the function has been appropriately transformed;
i.e., instead of writing the model as in Eq (16.131), it would be written as:
ln C(ti ) = ln 0 1 ti + i

(16.132)

But this implies that the error is multiplicative in the original variable. It is
important, before taking such a step, to take time to consider whether such a
multiplicative error structure is reasonable or not.
Thus, in employing transformations to deal with these so-called intrinsically linear models, the most important issue lies in determining the proper
error structure. Such transformations should be used with care; alternatively,
the parameter estimates obtained from such an exercise should be considered as approximations that may require further renement by using more
advanced nonlinear regression techniques. Notwithstanding, many engineering problems involving models of this kind have beneted from the sort of
linearizing transformations discussed here.

16.4

Multiple Linear Regression

In many cases, the response variable Y depends on several independent

variables, x1 , x2 , . . . , xm . Under these circumstances, the simplest possible regression model is:
Y = 0 + 1 x1 + 2 x2 + + m xm +

(16.133)

with m independent predictor variables, xi ; i = 1, 2, . . . , m, and m+1 unknown

parameters, = (0 , 1 , . . . , m ), the regression coecients. Eq (16.133)
represents a multiple linear regression model. An example is when Y , the
conversion obtained from a catalytic process depends on the temperature, x1 ,
and pressure, x2 , at which the reactor is operated, according to:
Y = 0 + 1 x1 + 2 x2 +

(16.134)

Just as the expected value of the response in Eq (16.19) represents a straight

line, the expected value in Eq (16.133) represents an m-dimensional hyperplane.

Regression Analysis

687

However, there is no reason to restrict the model to the form in Eq (16.133).

With more than one independent variable, it is possible for the response variable to depend on higher order powers of, as well as interactions between,
some of the variables: for example, a better representation of the relationship
between yield and temperature and pressure might be:
Y = 0 + 1 x1 + 2 x2 + 11 x21 + 22 x22 + 12 x1 x2 +

(16.135)

Such models as these are often specically referred to as response surface

models and we shall have cause to revisit them later on in upcoming chapters.
For now, we note that in general, most multiple regression models are, for the
most part, justied as approximations to the more general expression,
Y = g(x1 , x2 , . . . , xm ; ) +

(16.136)

where neither the true form of g(x1 , x2 , . . . , xm ; ), nor the parameters,

are known. If g(.) is analytic, then by taking a Taylor series expansion and
truncating after a pre-specied number of terms, the result will be a multiple
regression model. The multiple regression function is therefore often justied
as a Taylor series approximation of an unknown, and perhaps more complex,
function. For example, Eq (16.135) arises when the Taylor expansion only goes
up to the second order.
Regardless of what form the regression model takes, keep in mind that so
long as the values of the independent variables are known, all such models can
always be recast in the form shown in Eq (16.133). For example, even though
there are two actual predictor variables x1 and x2 in the model in Eq (16.135),
if we dene new variables x3 = x21 ; x4 = x22 ; x5 = x1 x2 , then Eq (16.135)
immediately becomes like Eq (16.133) with m = 5. Thus, it is without loss
of generality that we consider Eq (16.133) as the general multiple regression
model.
Observe that the model in Eq (16.133) is a direct generalization of the twoparameter model in Eq (16.19); we should therefore expect the procedure for
estimating the increased number of parameters to be similar to the procedure
discussed earlier. While this is true in principle, the analysis is made more
tractable by using matrix methods, as we now show.

16.4.1

General Least Squares

Obtaining estimates of the m parameters, i , i = 1, 2, . . . , m, from data

(yi ; x1i , x2i , . . . , xmi ) via the least-squares technique involves minimizing the
sum-of-squares function,
S() =

[yi (0 + 1 x1i + 2 x2i + + m xmi )]2

(16.137)

i=1

The technique calls for taking derivatives with respect to each parameter, setting the derivative to zero and solving the resultant equations for the unknown

688

Random Phenomena

parameters, i.e.,
n

S
= 2
[yi (0 + 1 x1i + 2 x2i + + m xmi )] = 0
0
i=1

(16.138)

and for 1 j m,
n

S
= 2
xji [yi (0 + 1 x1i + 2 x2i + + m xmi )] = 0
j
i=1

(16.139)

These expressions rearrange to give the general linear regression normal equations:
n

= 0 n + 1

i=1
n

i=1
n

x1i + 2

i=1

yi x1i
yi xji

i=1

= 0
= 0

n

i=1
n

x1i + 1
xji + 1

i=1

x2i + + m

i=1
n

i=1
n

x21i + 2

xmi

i=1
n

x2i x1i + + m

i=1

x1i xji + 2

i=1

xmi x1i

i=1

x2i xji + + m

i=1

xmi xji

i=1

m+1 linear equations to be solved simultaneously to produce the least-squares

estimates for the m + 1 unknown parameters. Even with a modest number of
parameters, such problems are best solved using matrices.

16.4.2

Matrix Methods

For specic data sets,

regression model equation

1 x11
y1
y2 1 x21

.. = ..
..
. .
.
yn

1 xn1

(yi ; x1i , x2i , . . . , xmi ); i = 1, 2, . . . , n, the multiple

in Eq (16.133) may be written as,

0
1
x12 x1m

xx2 x2m
1 2
(16.140)
..
..
.. .. + ..
.
.
. . .
xn2 xnm
m
n

which may be consolidated into the explicit matrix form,

y = X +

(16.141)

where y is the n-dimensional vector of response observations, with X as the

n m matrix of values of the predictor variables used to generate the n
observed responses; is the m-dimensional vector of unknown parameters, and
is the n-dimensional vector of random errors associated with the response
observations. Obtaining the least-squares estimate of the parameter vector
involves the same principle of minimizing the sum of squares, which, this time
is given by
S() = (y X)T (y X)
(16.142)

Regression Analysis

689

where the superscript, T , represents the vector or matrix transpose. Taking

derivatives with respect to the parameter vector and setting the result to
zero yields:
S
= 2XT (y X) = 0
(16.143)

resulting nally in the matrix form of the normal equations:

XT y = XT X

(16.144)

(compare with series of equations shown earlier). This matrix equation is

easily solved for the unknown parameter vector to produce the least squares
solution:

1
= XT X

XT y
(16.145)

Properties of the Estimates

To characterize the estimates, we begin by introducing Eq (16.141) into
Eq (16.145) for y to obtain:

1
XT (X + )
XT X
1

XT
= + XT X

(16.146)

We may now use this expression to obtain the mean and variance of these
estimates as follows. First, by taking expectations, we obtain:
=
E()
=

1

XT E()
+ XT X

(16.147)

as given
because X is known and E() = 0. Thus, the least-squares estimate
in Eq (16.145) is unbiased for . As for the co-variance of the estimates, rst,
by denition, E(T ) = is the random error covariance matrix; then from
the assumption that each i is independent, and identically distributed, with
the same variance, 2 , we have that
= 2 I

(16.148)

where 2 is the variance associated with each random error, and I is the
identity matrix. As a result,
(
'
= E (
)(
)T
V ar()

(16.149)

690

Random Phenomena

which, from Eq (16.146) becomes

1

1
= E XT X
V ar()
XT T X XT X

1

1
XT X
XT E(T )X XT X

1

1
=
XT X
XT 2 IX XT X

1
=
XT X
2
=

(16.150)

Thus, the covariance matrix of the estimates, is given by

1
= XT X
2

(16.151)

, the model estimate of

As usual, 2 must be estimated from data. With y
the response data vector y now given by:

= X
y

(16.152)

the residual error vector, e, is therefore dened as:

e=yy

(16.153)

so that the residual error sum-of-squares will be given by

SSE = eT e

(16.154)

It can be shown that with p = m + 1 as the number of estimated parameters,

the mean error sum-of-squares
se =

eT e
np

(16.155)

is an unbiased estimate of .
Thus, following the typical normality assumption on the random error
component of the regression model, we now conclude that the least-squares
has a multivariate normal distribution, M V N (, ), with
estimate vector, ,

the covariance matrix as given in Eq (16.151). This fact is used to test hypotheses regarding the signicance or otherwise of the parameters in precisely
the same manner as before. The t-statistic arises directly from substituting
data estimate se for in Eq (16.151).
Thus, when cast in matrix form, the multiple regression problem, is seen
to be merely a higher dimensional form of the earlier simple linear regression
problem; the model equations are structurally similar:
y =
y =

X +
x +

Regression Analysis

691

as are the least-squares solutions:

=
or =

1
XT X
XT y
n
xi yi
i=1
n
2
i=1 xi
Sxy
Sxx

(16.156)

The computations for multiple regression problems become rapidly more complex, but all the results obtained earlier for the simpler regression problem
transfer virtually intact, including hypothesis tests of signicance for the parameters, the values for the coecient of determination, R2 (and its ad2
justed variant, Radj
) for assessing the model adequacy in capturing the data
information. Fortunately, these computations are routinely carried out very
conveniently by computer software packages. Nevertheless, the reader is reminded that the availability of these computer programs has relieved us only
of the computational burden; the task of understanding what these computations are based on remains very much an important responsibility of the
practitioner.
Residuals Analysis
The residuals in multiple linear regression are given by Eq (16.153) above.
Obtaining the standardized version of residuals in this case requires the introduction of a new matrix, H, the so-called hat matrix. If we introduce the
least-squares estimate into the expression for the vector of model estimates in
Eq (16.152), we obtain:
1

= X XT X
XT y
y
= Hy
where

1
H = X XT X
XT

(16.157)

(16.158)

is called the hat matrix because it relates the actual observations vector,
. This matrix has some unique characteristics:
y, to the vector of model ts, y
for example, it is an idempotent matrix, meaning that
HH = H

(16.159)

, may therefore be represented as:

The residual vector, e = y y
= (I H)y
e=yy

(16.160)

(The matrix (I - H) is also idempotent). If hii represents the diagonal elements

of H, the standardized residuals are obtained for multiple regression problems

692

Random Phenomena
Surface Plot of Yield vs Temp, Pressure

90
Yield

88
100
95

86
90
1.4

1.6

Pressure

1.8

2.0

Temp

FIGURE 16.10: Catalytic process yield data of Table 16.7

as:

(16.161)
se 1 hii
These standardized residuals are the exact equivalents of the ones shown in
Eq (16.116) for the simple linear regression case.
The next example illustrates an application of these results.
ei =

Example 16.8: QUANTIFYING TEMPERATURE AND

PRESSURE EFFECTS ON YIELD
In an attempt to quantify the eect of temperature and pressure on the
yield obtained from a laboratory scale catalytic process, the data shown
in Table 16.7 was obtained from a series of designed experiments where
temperature was varied over a relatively narrow range, from 85 C to
100 C, and pressure from 1.25 atmospheres to 2 atmospheres. If Yield
is y, Temperature is x1 and Pressure, x2 , obtain a regression model of
the type in Eq (16.135) and evaluate the model t.
Solution:
A 3-D scatter plot of the data is shown in Fig 16.10, where it appears
as if the data truly fall on a plane.
The results from an analysis carried out using MINITAB is as follows:
Regression Analysis: Yield versus Temp, Pressure
The regression equation is
75.9 + 0.0757 Temp + 3.21 Pressure
Predictor
Coef
SE Coef
T
P
Constant
75.866
2.924
25.95 0.000
Temp
0.07574 0.02977
2.54 0.017
Pressure
3.2120
0.5955
5.39 0.000

Regression Analysis

TABLE 16.7:

Laboratory experimental
data on Yield obtained from a catalytic
process at various temperatures and pressures
Yield (%) Temp ( C) Pressure (Atm)
86.8284
85
1.25
87.4136
90
1.25
86.2096
95
1.25
87.8780
100
1.25
86.9892
85
1.50
86.8632
90
1.50
86.8389
95
1.50
88.0432
100
1.50
86.8420
85
1.75
89.3775
90
1.75
87.6432
95
1.75
90.0723
100
1.75
88.8353
85
2.00
88.4265
90
2.00
90.1930
95
2.00
89.0571
100
2.00
85.9974
85
1.25
86.1209
90
1.25
85.8819
95
1.25
88.4381
100
1.25
87.8307
85
1.50
89.2073
90
1.50
87.2984
95
1.50
88.5071
100
1.50
90.1824
85
1.75
86.8078
90
1.75
89.1249
95
1.75
88.7684
100
1.75
88.2137
85
2.00
88.2571
90
2.00
89.9551
95
2.00
90.8301
100
2.00

693

694

Random Phenomena
S = 0.941538 R-Sq = 55.1 % R-Sq(adj) = 52.0%
Thus, the tted regression line equation is, in this case
y = 75.9 + 0.0757x1 + 3.21x2

(16.162)

The p-values associated with all the parameters indicate signicance;

the estimate of the error standard deviation is as shown (0.94) with
2
values indicating that the model explanation of the
the R2 and Radj
variation in the data is only modest.
The tted plane represented by Eq (16.162) and the standardized
residual errors are shown in Fig 16.11. There is nothing unusual about
2
the residuals but the relatively modest values of R2 and Radj
seem to
suggest that the true model might be somewhat more complicated than
the one we have postulated and tted here; it could also mean that there
is signicant noise associated with the measurement, or both.

16.4.3

Some Important Special Cases

Weighted Least Squares

For a wide variety of reasons, ranging from xi variables with values that
are orders of magnitude apart (e.g., if x1 is temperature in the 100s of degrees,
while x2 is mole fraction, naturally scaled between 0 and 1), to measurement
errors with non-constant variance-covariance structures, it is often necessary
to modify the basic multiple linear regression problem by placing more or less
weight on dierent observations. Under these circumstances, the regression
model equation in Eq (16.141) is modied to:
Wy = WX + W

(16.163)

where W is an appropriately chosen (n n) weighting matrix. Note that

the pre-multiplication in Eq (16.163) has not changed the model itself; it
merely allows a re-scaling of the X matrix and/or the error vector. However,
the introduction of the weights does aect the solution to the least-squares
optimization problem. It can be shown that in this case, the sum-of-squares
S() = (Wy WX)T (Wy WX)

(16.164)

1
W LS = XT WT WX

XT WT Wy

(16.165)

is minimized by

This is known as the weighted least-squares (WLS) estimate, and it is easy to

establish that regardless of the choice of W,

W LS =
E
(16.166)

Regression Analysis

695

Surface Plot of Yield-fit vs Temp, Pressure

Yield-fit

88
100
95

90
1.4

1.6

Pressure

1.8

2.0

Temp

Versus Fits
(response is Yield)

Standardized Residual

-1

-2
86

88
Fitted Value

FIGURE 16.11: Catalytic process yield data of Table 16.1. Top: Fitted plane of Yield
as a function of Temperature and Pressure; Bottom: standardized residuals versus tted
value yi . Nothing appears unusual about these residuals.

696

Random Phenomena

so that this is also an unbiased estimate of .

In cases where the motivation for introducing weights arises from the structure of the error covariance matrix, , it is recommended that W be chosen
such that
WT W = 1
(16.167)
Under these circumstances, the covariance matrix of the WLS estimate can
be shown to be given by:

1
W LS = XT 1 X
(16.168)
making it comparable to Eq (16.151). All regression packages provide an option for carrying out WLS instead of ordinary least squares.
Constrained Least Squares
Occasionally, one encounters regression problems in engineering where the
model parameters are subject to a set of linear equality constraints. For example, in a blending problem where the unknown parameters, 1 , 2 , and 3
to be estimated from experimental data, are the mole fractions of the three
component materials, clearly
1 + 2 + 3 = 1

(16.169)

In general such linear constraints are of the form:

L = v

(16.170)

When subject to such constraints, obtaining the least squares estimate of the
parameters in the model Eq (16.141) now requires attaching these constraint
equations to the original problem of minimizing the sum of squares function
S(). It can be shown that when the standard tools of Lagrange multipliers
are used to solve this constrained optimization problem, the solution is:

1

1
1
CLS =
+ XT X

LT L X T X
LT
(v L)
(16.171)
is the normal, unconthe constrained least squares (CLS) estimate, where
strained least-squares estimate in Eq (16.145).
If we dene a gain matrix:

1
1
1

LT L X T X
LT
(16.172)
= XT X
then, Eq (16.171) may be rearranged to:
CLS

CLS
or

+ (v L)

= v + (I L)

(16.173)
(16.174)

Regression Analysis

697

where the former (as in Eq (16.171)) emphasizes how the constraints provide a
CLS
correction to the unconstrained estimate, and the latter emphasizes that
provides a compromise between unconstrained estimates and the constraints.

Ridge Regression
given in Eq (16.145), will be unThe ordinary least squares estimate, ,
acceptable for ill-conditioned problems for which XT X is nearly singular
typically because the determinant, |XT X| 0. This will occur, for example,
when there is near-linear dependence in some of the predictor variables, xi ,
or when some of the xi variables are orders of magnitude dierent from others, and the problem has not been re-scaled accordingly. The problem created
by ill-conditioning manifests in the form of overly inated values for the elements of the matrix inverse (XT X)1 as a result of the vanishingly small
will be too
determinant. Consequently, the norm of the estimate vector, ,
large, and the uncertainty associated with the estimates (see Eq 16.151) will
be unacceptably large.
One solution is to augment the original model equation as follows:

y
X
= +
0
kI

(16.175)

or,
y = X +

(16.176)

where 0 is an m-dimensional vector of zeros, k is a constant, and I is the

identity matrix. Instead of minimizing the original sum of squares function,
minimizing the sum of squares based on the augmented equation (16.176)
results in the so-called ridge regression estimate:
RR = (XT X + k 2 I)1 XT y

(16.177)

As is evident from Eq (16.175), the purpose of the augmentation is to use

the constant k to force the estimated parameters to be close to 0, preventing
over-ination. As the value chosen for k increases, the conditioning of the

matrix (XT X + k 2 I) improves, reducing the otherwise inated estimates .

However, this improvement is achieved at the expense of introducing bias into
RR is biased, its variance
the resulting estimate vector. Still, even though
(See Hoerl, (1962)1 , Hoerl and
is much better than that of the original .
1 Hoerl, A.E. (1962). Application of ridge analysis to regression problems, Chem. Eng.
Prog. 55, 5459.

698

Random Phenomena

Kennard, (1970a)2, (1970b)3 ). Selecting an appropriate value of k is an art.

(See Marquardt, (1970)4 ).

16.4.4

Recursive Least Squares

Problem Formulation
A case often arises in engineering where the experimental data used to estimate parameters in the model in Eq (16.141) are available sequentially. After
accumulating a set of n observations, yi ; i = 1, 2, . . . , n, and, subsequently
using this n-dimensional data vector, yn , to obtain the parameter estimates
as:

1
n = XT X
XT y
(16.178)

suppose that a new, single observation yn+1 then becomes available. This new
data can be combined with past information to obtain an updated estimate
that reects the additional information about the parameter contained in the
data, information represented by:
yn+1 = xTn+1 + n+1

(16.179)

where xTn+1 is the m-dimensional vector of the independent predictor variables

used to generate this new (n + 1)th observation, and n+1 is the associated
random error component.
In principle, we can append this new information to the old to obtain:

yn
X
n
= +
(16.180)
yn+1
n+1
xTn+1
or, more compactly,
yn+1 = Xn+1 + n+1

(16.181)

so that the new X matrix, Xn+1 , is now an (n + 1) m matrix, with the

data vector yn+1 now of dimension (n + 1). From here, we can use these
new matrices and vectors to obtain the new least-squares estimate directly, as
before, to give:
1

n+1 = XT Xn+1
XT y
(16.182)

n+1

n+1 n+1

Again, in principle, we can repeat this exercise each time a new observation
becomes available. However, such a strategy requires that we recompute the
2 Hoerl A.E., and and R.W. Kennard, (1970). Ridge regression. Biased estimation for
nonorthogonal problems, Technometrics, 12, 5567.
3 Hoerl A.E., and and R.W. Kennard, (1970). Ridge regression. Applications to
nonorthogonal problems, Technometrics, 12, 6982.
4 Marquardt, D.W., (1970). Generalized inverses, Ridge regression, Biased linear estimation, and Nonlinear estimation, Technometrics, 12, 591612.

Regression Analysis

699

estimates from scratch every time as if for the rst time. While it is true that
the indicated computational burden is routinely borne nowadays by computers, the fact that the information is available recursively raises a fundamental
n+1 all over again as
question: instead of having to recompute the estimate
in Eq (16.182) every time new information is available, is it possible to de n directly with the new information? The
termine it by judiciously updating
n+1 is
answer is provided by the recursive least-squares technique whereby

obtained recursively as a function of n .

Recursive Least Squares Estimation
n+1 , directly from the
We begin by obtaining the least-squares estimate,
partitioned matrices in Eq (16.180), giving the result
(1 '
(
'
n+1 = XT X + xn+1 xT
XT yn + xn+1 yn+1

n+1
Now, let us dene

(1
'
Pn+1 = XT X + xn+1 xTn+1

(16.183)

(16.184)

so that by analogy with Eq (16.178),

1
Pn = XT X
;
then,

(16.185)

T
T
P1
n+1 = X X + xn+1 xn+1

From here, on the one hand, by simple rearrangement,

T
XT X = P1
n+1 xn+1 xn+1

(16.186)

and on the other, using Eqn (16.185),

1
T
P1
n+1 = Pn + xn+1 xn+1

(16.187)

Now, as a result of Eq (16.184), the least-squares estimate in Eq (16.183)

may be written as:
'
(
n+1 = Pn+1 XT y + xn+1 yn+1

n
= Pn+1 XT yn + Pn+1 xn+1 yn+1

(16.188)

Returning briey to Eq (16.178) from which

n = XT y ,
(XT X)
n
upon introducing Eq (16.186), we obtain

1
n = XT y
Pn+1 xn+1 xTn+1
n

(16.189)

700

Random Phenomena

Introducing this into Eq (16.188) yields

n+1 = Pn+1 P1 xn+1 xT

n+1 n + Pn+1 xn+1 yn+1

n+1

n
n+1 =
n + Pn+1 xn+1 yn+1 xT
(16.190)
or,
n+1
n+1
This last equation is the required recursive expression for determining
n and new response data, yn+1 , generated with the new values xT
from
n+1
for the predictor variables. The gain matrix, Pn+1 , is itself generated recursively from Eq (16.187). And now, several points worth noting:
n ,
1. The matrix Pn is related to the covariance matrix of the estimate,
(see Eq (16.151), so that Eq (16.187) represents the recursive evolution
of the covariance of the estimates as n increases;
2. The term in parentheses in Eq (16.190) resembles a correction term,
the discrepancy between the actual observed response, yn+1 , and an `
apriori value predicted for it (before the new data is available) using the
n , and the new predictor variables, xT ;
previous estimate,
n+1
0 ,
3. This recursive procedure allows us to begin with an initial estimate,
along with a corresponding (scaled) covariance matrix, P0 , and proceed
recursively to estimate the true value of the parameters one data point at
a time, using Eq (16.187) rst to obtain an updated covariance matrix,
and Eq (16.190) to update the parameter estimate;
4. Readers familiar with Kalman ltering in dynamical systems theory will
immediately recognize the structural similarity between the combination
of Eqs (16.187) and (16.190) and the discrete Kalman lter.

16.5
16.5.1

Polynomial Regression
General Considerations

A special case of multiple linear regression occurs when, in Eq (16.133),

the response Y depends on powers of a single predictor variable, x, i.e.,
Y = 0 + 1 x + 2 x2 + + m xm +

(16.191)

in which case xi = xi . An example was given earlier in Eq (16.135), with Y

as a quadratic function of x.
This class of regression models is important because in engineering, many
unknown functional relationships, y(x), can be approximated by polynomials.
Because Eq (16.191) is a special case of Eq (16.133), all the results obtained

Regression Analysis

701

earlier for the more general problem transfer directly, and there is not much to
add for this restricted class of problems. However, in terms of practical application, there are some peculiarities unique to polynomial regression analysis.
In many practical problems, the starting point in polynomial regression
is often a low order linear model; when residual analysis indicates that the
simple model is inadequate, the model complexity is then increased, typically
by adding the next higher power of x, until the model is deemed adequate.
But one must be careful: tting an mth order polynomial to m + 1 data points
(e.g. tting a straight line to 2 points) will produce a perfect R2 = 1 but
the parameter estimates will be unreliable. The primary pitfall to avoid in
such an exercise is therefore overtting, whereby the polynomial model is
of an order higher than can be realistically supported by the data. Under
such circumstances, the improvement in R2 must be cross-checked against the
2
value.
corresponding Radj
The next examples illustrate the application of polynomial regression.
Example 16.9: BOILING POINT OF HYDROCARBONS:
REVISITED
In Example 16.6, a linear two-parameter model was postulated for the
relationship between the number of carbon atoms in the hydrocarbon
compounds listed in Table 16.1 and the respective boiling points. Upon
evaluation, however, the model was found to be inadequate; specically,
the residuals indicated the potential for a left over quadratic component. Postulate the following quadratic model,
Y = 0 + 1 x + 2 x 2 +

(16.192)

and evaluate a least-squares t of this model to the data. Compare this

model t to the simple linear model obtained in Example 16.6.
Solution:
Once more, when we use MINITAB for this problem, we obtain the following results:
Regression Analysis: Boiling Point versus n, n2
The regression equation is
Boiling Point = - 218 + 66.7 n - 3.02 n2
Predictor
Coef
SE Coef
T
P
Constant
-218.143
8.842
-24.67 0.000
n
66.667
4.508
14.79 0.000
n2
-3.0238
0.4889
-6.18 0.002
S = 6.33734 R-Sq = 99.7% R-Sq(adj) = 99.6%
Thus, the tted quadratic regression line equation is
y = 218.14 + 66.67x 3.02x2

(16.193)

where, as before, y is the hydrocarbon compound boiling point, and the

number of carbon atoms it contains is x.

702

Random Phenomena
Note how the estimates for 0 and 1 are now dierent from the
respective values obtained when these were the only two parameters in
the model. This is a natural consequence of adding a new component
to the model; the responsibility for capturing the variability in the data
is now being shared by three parameters instead of 2, and the best
estimates of the model parameter set will change accordingly.
Before inspecting the model t and the residuals, we note rst that
the three parameters in this case also are all signicantly dierent from
zero (the p-values are zero for the constant term and the linear term
and 0.002 for the quadratic coecient). As expected, there is an improvement in R2 for this more complicated model (99.7% versus 97.4%
for the simpler linear model); furthermore, this improvement was also
2
(99.6% versus
accompanied by a commensurate improvement in Radj
97.0% for the simpler model). Thus, the improved model performance
was not achieved at the expense of overtting, indicating that the added
quadratic term is truly warranted. The error standard deviation also
shows an almost three-fold improvement from S = 17.0142 for the linear model to S = 6.33734, again indicating that more of the variability
in the data has been captured by the more complicated model.
The model t to the data, shown in the top panel of Fig 16.12,
indicates a much-improved t, compared with the one in the top panel
of Fig 16.9. This is also consistent with everything we have noted so
far. However, the normalized residuals versus tted values, plotted in
the bottom panel of Fig 16.12 shows that there is still some left over
structure, the improved t notwithstanding. The implication is that
perhaps an additional cubic term might be necessary to capture the
remaining structural information still visible in the residual plot. This
suggests further revising the model as follows:
Y = 0 + 1 x + 2 x 2 + 3 x 3 +

(16.194)

The problem of tting an adequate regression model to the data in Table 16.1
concludes with this next example.
Example 16.10: BOILING POINT OF HYDROCARBONS:
PART III
As a follow up to the analysis in the last example, t the cubic equation
Y = 0 + 1 x + 2 x 2 + 3 x 3 +

(16.195)

to the data in Table 16.1, evaluate the model t and compare it to the
t obtained in Example 16.9.
Solution:
This time around, the MINITAB results are as follows:
Regression Analysis: Boiling Point versus n, n2, n3
The regression equation is

Regression Analysis

703

S
R-Sq
R-Sq(adj)

Boiling Point = - 218.1 + 66.67 n - 3.024 n**2

100

6.33734
99.7%
99.6%

Boiling Point

50
0
-50
-100
-150
-200
0

4
n

Versus Fits
(response is Boiling Point)

Standardized Residual

-1

-2
-150

-100

-50

0
Fitted Value

100

FIGURE 16.12: Modeling the dependence of the boiling points (BP) of hydrocarbon
compounds in Table 16.1 on the number of carbon atoms in the compound: Top: Fitted
quadratic curve of BP versus n, the number of carbon atoms; Bottom: standardized
residuals versus tted value yi . Despite the good t, the visible systematic structure still
left over in the residuals suggests adding one more term to the model.

704

Random Phenomena
Boiling Point = - 244 + 93.2 n
Predictor
Coef
SE Coef
Constant
-243.643
8.095
n
93.197
7.325
n2
-9.978
1.837
n3
0.5152
0.1348

- 9.98
T
-30.10
12.72
-5.43
3.82

n2 + 0.515 n3
P
0.000
0.000
0.006
0.019

S = 3.28531 R-Sq = 99.9% R-Sq(adj) = 99.9%

The tted cubic regression equation is
y = 243.64 + 93.20x 9.98x2 + 0.515x3

(16.196)

Note that the estimates for 0 and 1 have changed once again, as has
the estimate for 2 . Again, this is a natural consequence of adding the
new parameter, 3 , to the model.
As a result of the p-values, we conclude once again, that all the four
parameters in this model are signicantly dierent from zero; the R2
2
and Radj
values are virtually perfect and identical, indicating that the
expenditure of four parameters in this model is justied.
The error standard deviation has improved further by a factor of
almost 2 (from S = 6.33734 for the quadratic model to S = 3.28531 for
this cubic model) and the model t to the data shows this improvement
graphically in the top panel of Fig 16.13. This time, the residual plot in
the bottom panel of Fig 16.13 shows no signicant left over structure.
Therefore, in light of all the factors considered above, we conclude that
the cubic t in Eq 16.196 appears to provide an adequate t to the data;
and that this has been achieved without the expenditure of an excessive
number of parameters.

A nal word: while polynomial models may provide adequate representations

of data (as in the last example), this should not be confused with a fundamental scientic explanation of the underlying relationship between x and Y . Also,
it is not advisable to extrapolate the model prediction outside of the range
covered by the data used to t the model. For example, the model in this last
example should not be used to predict the boiling point of hydrocarbons with
carbon atoms 9.

16.5.2

Orthogonal Polynomial Regression

Let pi (x); i = 0, 1, 2, . . . , m, be a sequence of ith order polynomials in the

single independent variable, x, dened on an interval [xL , xR ] on the real line.
For our purposes here, these polynomials take on discrete values pi (xk ) at
equally spaced values xk ; k = 1, 2, . . . , n, in the noted interval. The sequence
of polynomials constitutes an orthogonal set if the following conditions hold:
2
n

i i = j;
pi (xk )pj (xk ) =
(16.197)
0
i = j
k=1

Regression Analysis

705

Fitted Line Plot

Boiling Point = - 243.6 + 93.20 n
- 9.978 n**2 + 0.5152 n**3
S
R-Sq
R-Sq(adj)

100

3.28531
99.9%
99.9%

Boiling Point

50
0
-50
-100
-150
-200
0

4
n

Versus Fits
(response is Boiling Point)

Standardized Residual

-1

-2
-200

-150

-100

-50
0
Fitted Value

100

FIGURE 16.13: Modeling the dependence of the boiling points (BP) of hydrocarbon
compounds in Table 16.1 on the number of carbon atoms in the compound: Top: Fitted
cubic curve of BP versus n, the number of carbon atoms; Bottom: standardized residuals
versus tted value yi . There appears to be little or no systematic structure left in the
residuals, suggesting that the cubic model provides an adequate description of the data.

706

Random Phenomena

An Example: Gram Polynomials

Without loss of generality, let the independent variable x be dened in the
interval [1, 1] (for variables dened on [xl , xR ], a simple scaling transformation is all that is required to obtain a corresponding variable dened instead
on [1, 1]); furthermore, let this interval be divided into n equal discrete intervals, k = 1, 2, . . . , n, to provide n values of x at x1 , x2 , . . . , xn , such that
x1 = 1, xn = 1, and in general,
xk =

2(k 1)
1
n1

(16.198)

The set of Gram polynomials dened on [-1,1] for xk as given above is,
p0 (xk ) =
p1 (xk ) =
p2 (xk ) =
p3 (xk ) =
..
.
p+1 (x)

1
xk
(n + 1)
3(n 1)

(3n2 7)
3
xk
xk
5(n 1)2
..
.

2 (n2 2 )
xk p (xk )
p1 (xk )
(42 1)(n 1)2
x2k

(16.199)

where each polynomial in the set is generated from the recurrence relation
in Eq (16.199), given the initial two, p0 (xk ) = 1 and p1 (xk ) = xk .
Example 16.11: ORTHOGONALITY OF GRAM POLYNOMIALS
Obtain the rst four Gram polynomials determined at n = 5 equally
spaced values, xk , of the independent variable, x, on the interval
1 x 1. Show that these polynomials are mutually orthogonal.
Solution:
First, from Eq (16.198), the values xk at which the polynomials are to
be determined in the interval 1 x 1 are:
x1 = 1; x2 = 0.5; x3 = 0; x4 = 0.5; x5 = 1.
Next, let the 5-dimensional vector, pi ; i = 0, 1, 2, 3, represent the
values of the polynomial pi (xk ) determined at these ve xk values: i.e.,

pi =

pi (x1 )
pi (x2 )
..
.
pi (x5 )

(16.200)

Regression Analysis

707
Variable
p0
p1
p2
p3

1.0

p(x)

0.5

0.0

-0.5

-1.0
1

3
k

FIGURE 16.14: Gram polynomials evaluated at 5 discrete points k = 1, 2, 3, 4, 5; p0

is the constant; p1 , the straight line; p2 , the quadratic and p3 , the cubic

.
Then, from

1
1

p0 =
1
1
1

the expressions given above, we obtain, using n = 5, that:

1
0.50
0.15

0.5
0.25
0.30

; p1 = 0 ; p2 = 0.50 ; p3 = 0.00

0.5
0.25
0.30
1
0.50
0.15

A plot of these values is shown in Fig 16.14 where we see that p0 (xk ) is
a constant, p1 (xk ) is a straight line, p2 (xk ) is a quadratic and p3 (xk ) is
a cubic, each one evaluated at the indicated discrete points.
To establish orthogonality, we compute inner products, pTi pj , for
all combinations of i = j. First, we note that pT0 pj is simply a sum of
all the elements in each vector, which is uniformly zero in all cases, i.e.,
pT0 pj =

pj (xk ) = 0

(16.201)

k=1

Next, we obtain:
pT1 p2 =

p1 (xk )p2 (xk ) = 0.500 + 0.125 + 0.000 0.125 + 0.500 = 0

k=1

pT1 p3 =

p1 (xk )p3 (xk ) = 0.15 0.15 + 0.00 0.15 + 0.15 = 0

k=1

pT2 p3 =

5

k=1

p2 (xk )p3 (xk ) = 0.075 0.075 + 0.000 + 0.075 + 0.075 = 0

708

Random Phenomena
For completeness, the sums of squares, i2 , are obtained below (note
monotonic decrease):
02 = pT0 p0 =

p0 (xk )p0 (xk ) = 5

k=1

12 = pT1 p1 =

p1 (xk )p1 (xk ) = 2.5

k=1

22 = pT2 p2 =

p2 (xk )p2 (xk ) = 0.875

k=1

32 = pT3 p3 =

p3 (xk )p3 (xk ) = 0.225

k=1

Application in Regression
Among many attractive properties possessed by orthogonal polynomials,
the following is the most relevant to the current discussion:
Orthogonal Basis Function Expansion: Any mth order polynomial,
U (x), can be expanded in terms of an orthogonal polynomial set,
p0 (x), p1 (x), . . . , pm (x), as the basis functions, i.e.,
U (x) =

i pi (x)

(16.202)

i=0

This result has some signicant implications for polynomial regression involving the single independent variable, x. Observe that as a consequence of
this result, the original mth order polynomial regression model in Eq (16.191)
can be rewritten as
Y (x) = 0 p0 (x) + 1 p1 (x) + 2 p2 (x) + + m pm (x) +

(16.203)

where we note that, given any specic set of orthogonal polynomial basis, the
one-to-one relationship between the original parameters, i , and the new set,
i , is easily determined. Regression analysis is now concerned with estimating
the new set of parameters, i , instead of the old set, i , a task that is rendered
dramatically easier because of the orthogonality of the basis set, pi (x), as we
now show.
Suppose that the data yi ; i = 1, 2, . . . , n, have been acquired using equally
spaced values xk ; k = 1, 2, . . . , n, in the range [xL , xR ] over which the orthogonal polynomial set, pi (x), is dened. In this case, from Eq (16.203), we will

Regression Analysis

709

have:
y(x1 ) =
y(x2 ) =
..
.

0 p0 (x1 ) + 1 p1 (x1 ) + 2 p2 (x1 ) + + m pm (x1 ) + 1

0 p0 (x2 ) + 1 p1 (x2 ) + 2 p2 (x2 ) + + m pm (x2 ) + 2
..
.

y(xn ) =

0 p0 (xn ) + 1 p1 (xn ) + 2 p2 (xn ) + + m pm (xn ) + n

which, when written in matrix form, becomes:

y = P +

(16.204)

The matrix P consists of vectors of the orthogonal polynomials computed

at the discrete values xk , just as we showed in Example 16.10 for the Gram
polynomials. The least-squares solution to this equation,

1
= PT P

PT y,
(16.205)
looks very much like what we have seen before, until we recall that, as a result of the orthogonality of the constituent elements of P, the matrix PT P is
diagonal, with elements i2 , because all the o-diagonal terms vanish identically (see Eq (16.197) and Example 16.10)). As a result, the expression in Eq
(16.205) is nothing but a collection of n isolated algebraic equations,
n
pi (xk )y(xk )
(16.206)

i = k=1 2
i
where
i2 =

pi (xk )pi (xk ); i = 0, 1, . . . m.

(16.207)

k=1

This approach has several additional advantages beyond the dramatically simplied computation:
1. Each parameter estimate,
i , is independent of the others, and its value
remains unaected by the order chosen for the polynomial model. In
other words, after obtaining the rst m parameter estimates for an mth
order polynomial model, should we decide to increase the polynomial
order to m + 1, the new parameter estimate,
m+1 , is simply obtained
as
n
pm+1 (xk )y(xk )

m+1 = k=1 2
(16.208)
m+1
using the very same data set y(xk ), and only introducing pm+1 (k), a precomputed vector of the (m+1)th polynomial. All the previously obtained
values for
i ; i = 1, 2, . . . , m, remain unchanged. This is very convenient
indeed. Recall that this is not the case with regular polynomial regression (see Examples 16.9 and 16.10) where a change in the order of the
polynomial model mandates a change in the values estimated for the
new set of parameters.

710

Random Phenomena

2. From earlier results, we know that the variance associated with the estimates,
i , is given by:

1
= PT P
2

(16.209)

But, by virtue of the orthogonality of the elements of the P matrix, this

reduces to:
2
(16.210)
2 i = 2

and since the value for the term i2 , dened as in Eq (16.207), is determined strictly by the placement of the design points, xk , where
the data is obtained, Eq (16.210) indicates that this approach makes
it possible to select experimental points such as to inuence the variance of the estimated parameters favorably, with obvious implications
for strategic design of experiments.
3. Finally, it can be shown that i2 decreases monotonically with i, indicating that the precision with which coecients of higher order polynomials
are estimated worsens with increasing order. This is also true for regular
polynomial regression, but it is not as obvious.
An example of how orthogonal polynomial regression has been used in
engineering applications may be found in Kristinsson and Dumont, 19935 ,
and 19966 .

16.6

Summary and Conclusions

The tting of simple empirical mathematical expressions to data is an

activity with which many engineers and scientists are very familiar, and perhaps have been engaged in even since high school. This chapter has therefore
been more or less a re-introduction of the reader to regression analysis, especially to the fundamental principles behind the mechanical computations
that are now routinely carried out with computer software. We have shown
regression analysis to be a direct extension of estimation to cases where the
mean of the random variation in the observations is no longer constant (as
in our earlier discussions) but now varies as a function of one or more independent variables. The primary problem in regression analysis is therefore
5 K. Kristinsson and G. A. Dumont, Paper Machine Cross Directional Basis Weight
Control Using Gram Polynomials, Proceedings of the Second IEEE Conference on Control
Applications, p235 -240, September 13 - 16, 1993, Vancouver, B.C.
6 K. Kristinsson and G. A. Dumont, Cross-directional control on paper machines using
Gram polynomials, Automatica 32 (4) 533 - 548 (1996)

Regression Analysis

711

the determination of the unknown parameters contained in the functional relationship (the regression model equation), given appropriate experimental
data. The primary method discussed in this chapter for carrying out this task
is the method of least squares. However, when regression analysis is cast as
the probabilistic estimation problem that it truly is fundamentally, one can
also employ the method of maximum likelihood to determine the unknown
parameters. However, this requires the explicit specication of a probability
distribution for the observed random variabilitysomething not explicitly required by the method of least squares. Still, under the normal distribution
assumption, maximum likelihood estimates of the regression model parameters coincide precisely with least squares estimates (See Exercises 16.5 and
16.6).
In addition to the familiar, we have also presented some results for specialized problems, for example, when variances are not uniform across observations (weighted least squares); when the parameters are not independent but
are subject to (linear) constraints (constrained least squares); when the data
matrix is poorly conditioned perhaps because of collinearity (ridge regression); and when information is available sequentially (recursive least squares).
Space constraints compel us to limit the illustration and application of these
techniques to a handful of end-of-chapter exercises and application problems,
which are highly recommended to the reader.
It bears repeating, in conclusion, that since all the computations required
for regression analysis are now routinely carried out with the aid of computers,
it is all the more important to concentrate on understanding the principles
behind these computations, so that computer-generated results can be interpreted appropriately. In particular, rst, the well-informed engineer should
understand the implications of the following on the problem at hand:
the results of hypothesis tests on the signicance of estimated parameters;
2
values as measures of how much of the information
the R2 and Radj
contained in the data has been adequately explained by the regression
model, and with the expenditure of how many signicant parameters;

the value computed for the standard error of the residuals.

These will always be computed by any regression analysis software as a matter
of course. Next, other quantities such as condence and prediction intervals,
and especially residuals, can be generated upon request. It is highly recommended that every regression analysis be accompanied by a thorough analysis
of the residuals as a matter of routine diagnostics. The principlesand
mechanicsbehind how the assumption (explicit or implicit) of the normality
of residuals are validated systematically and rigorously is discussed in the next
chapter as part of a broader discussion of probability model validation.

712

Random Phenomena

REVIEW QUESTIONS
1. In regression analysis, what is an independent variable and what is a dependent
variable?
2. In regression analysis as discussed in this chapter, which variable is deterministic
and which is random?
3. In regression analysis, what is a predictor and what is a response variable?
4. Regression analysis is concerned with what tasks?
5. What is the principle of least squares?
6. In simple linear regression, what is a one-parameter model; what is a twoparameter model?
7. What are the two main assumptions underlying regression analysis?
8. In simple linear regression, what are the normal equations and how do they
arise?
9. In simple linear regression, under what conditions are the least squares estimates
identical to the maximum likelihood estimates?
10. In regression analysis, what are residuals?
11. What does it mean that OLS estimators are unbiased?
12. Why is the condence interval around the regression line curved? Where is the
interval narrowest?
13. What does hypothesis testing entail in linear regression? What is H0 and what
is Ha in this case?
14. What is the dierence between using the regression line to estimate mean responses and using it to predict a new response?
15. Why are prediction intervals consistently wider than condence intervals?
16. What is R2 and what is its role in regression?
2
17. What is Radj
and what dierentiates it from R2 ?

18. Is an R2 value close to 1 always indicative of a good regression model? By the

2
value close to 1 always indicative of a good regression model?
same token, is an Radj

Regression Analysis

713

19. In the context of simple linear regression, what is an F -test used for?
20. What is the connection between R2 , the coecient of determination, and the
correlation coecient?
21. If a regression model represents a data set adequately, what should we expect
of the residuals?
22. What does residual analysis allow us to do?
23. What activities are involved in residual analysis?
24. What are standardized residuals?
25. Why is it recommended for residual plots to be based on standardized residuals?
26. The term linear in linear regression refers to what?
27. As far as regression is concerned, how does one determine whether the problem
is linear or nonlinear?
28. What is an intrinsically linear model?
29. In employing variable transformations to convert nonlinear regression problems
to linear ones, what important issue should be taken into consideration?
30. What is multiple linear regression?
31. What is the hat matrix and what is its role in multiple linear regression?
32. What are some reasons for using weights in regression problems?
33. What is constrained least squares and what class of problems require this approach?
34. What is ridge regression and under what condition is it recommended?
35. What is the principle behind recursive least squares?
36. What is polynomial regression?
37. What is special about orthogonal polynomial regression?
38. What is the orthogonal basis function expansion result and what are its implications for polynomial regression?

714

Random Phenomena

EXERCISES
16.1 Given the one-parameter model,
yi = xi + i
where {yi }n
i=1 is the specic sample data set, and i , the random error component,
has zero mean and variance 2 , it was shown in Eq (19.49) that the least squares
estimate of the parameter is
n
x i yi
= i=1
n
2
i=1 xi
=
(i) Show that this estimate is unbiased for , i.e., E()

(ii) Determine V ar().

16.2 Consider the random sample, Y1 , Y2 , . . . , Yn drawn from a population characterized by a single, constant parameter, , the population mean, so that the random
variable Y may then be written as:
Yi = + i
Determine , the weighted least squares estimate of by solving the minimization
problem
n

Wi (Yi )2
min Sw () =

i=1

and hence establish the results in Eqs (16.11) and (16.12).

16.3 For the one-parameter model,
yi = xi + i
where {yi }n
i=1 is the specic sample data set, and i , the random error component,
has zero mean and variance 2 ,
(i) Determine , the weighted least squares estimate of by solving the minimization problem
n

Wi [yi (xi )]2
min Sw () =

i=1

and compare it to the ordinary least squares estimate obtained in Eq (19.49).

(ii) Show that E( ) = .
(iii) Determine V ar( ).
16.4 For the two-parameter model,
yi = 0 + 1 xi + i
where, as in Exercise 16.3, {yi }n
i=1 is the sample data, and i , the random error, has
zero mean and variance 2 ,

Regression Analysis

715

(i) Determine = (0 , 1 ), the weighted least squares estimate of = (0 , 1 ),

by solving the minimization problem
min Sw (0 , 1 ) =

0 ,1

Wi [yi (0 + 1 xi )]2

i=1

and compare these to the ordinary least squares estimates obtained in Eqs (16.38)
and (16.37).
(ii) Show that E(0 ) = 0 ; and E(1 ) = 1 .
(iii) Determine V ar(0 ) and V ar(1 ).
sample from a Gaussian distribution with mean
16.5 Let Y1 , Y2 , . . . , Yn be a random

and variance, 2 ; i.e., Y N (x, ), 2 .
(i) For the one-parameter model where
= x
determine the maximum likelihood estimate of the parameter and compare it to
the least squares estimate in Eq (19.49).
(ii) When the model is
= 0 + 1 x
(the two-parameter model), determine the maximum likelihood estimates of the parameters 0 and 1 ; compare these to the corresponding least squares estimates in
Eqs (16.38) and (16.37).
16.6 Let each individual observation, Yi be an independent
sample from a Gaussian
distribution with mean and variance, i2 , i.e., Yi N (x, ), i2 ; i = 1, 2, . . . , n;
where the variances are not necessarily equal.
(i) Determine the maximum likelihood estimate of the parameter in the oneparameter model,
= x
and show that it has the form of a weighted least squares estimate. What are the
weights?
(ii) Determine the maximum likelihood estimates of the parameters 0 and 1 in the
two-parameter model,
= 0 + 1 x
Show that these are also similar to weighted least squares estimates. What are the
weights in this case?
16.7 From the denitions of the estimators for the parameters in the two-parameter
model given in Eqs (16.46) and (16.47), i.e.,
1

Sxy
Sxx
Y 1 x

obtain expressions for the respective variances for each estimator and hence establish the results given in Eqs (16.55) and (16.56).

716

Random Phenomena

16.8 A fairly common mistake in simple linear regression is the use of the oneparameter model (where the intercept is implicitly set to zero) in place of the more
general two-parameter model, thereby losing the exibility of estimating an intercept
which may or may not be zero. When the true intercept in the data is non-zero, such
a mistake will lead to an error in the estimated value of the single parameter, the
slope , because the least squares criterion has no option but to honor the implicit
constraint forcing the intercept to be zero. This will compromise the estimate of the
true slope. The resulting estimation error may be quantied explicitly as follows.
First, show that the relationship between 1 , the estimated slope in the two the estimated slope in the one-parameter model, is:
parameter model, and ,
n

x2i
y
x2
n n

1 = n i=1
2
2
2
2
x

n
x
x

i=1 i
i=1 i
so that the two slopes are the same if and only if
y
=
x

(16.211)

which will be the case when the intercept is truly zero. Next, show from here that
y
+ (1 )1
=
(16.212)
x

2
with = n
x2 / n
i=1 xi , indicating clearly the least squares compromisea weighted

/
y (the true slope
average of 1 , (the true slope when the intercept is not zero), and x
when the intercept is actually zero).
Finally, show that the estimation error, e = 1 will be given by:

y
(16.213)
e = 1 = 1
x

16.9 By dening as Syy the total variability present in the data, i.e., n
)2
i=1 (yi y
(see Eq (16.32)), and by rearranging this as follows:
Syy =

[(yi yi ) (
y yi )]2

i=1

expand and simplify this expression to establish the important result in Eq (16.89),
n

(yi y)2

or Syy

i=1

(
yi y)2 +

i=1

(yi yi )2

i=1

SSR + SSE

16.10 Identify which of the following models presents a linear or nonlinear regression
problem in estimating the unknown parameters, i .
(i) Y

(ii) Y

(iii) Y

0 + 2 x3 +
1
+
0 +
x

x
0 + 1 e + 2 sin x + 3 x +

(iv) Y

0 e2 x +

(v) Y

0 x11 x22 +

Regression Analysis

717

16.11 The following models, sampled from various branches of science and engineering, are nonlinear in the unknown parameters. Convert each into a linear regression
model; indicate an explicit relationship between the original parameters and the
transformed ones.
(i) Antoines Equation: Vapor pressure, P vap , as a function of temperature, T :
P vap = e

1
0 T +

(ii) Cellular growth rate (exponential phase): N , number of cells in the culture,
as a function of time, t.
N = 0 e1 t
(iii) Kleibers law of bioenergetics: Resting energy expenditure, Q0 , as a function
of an animals mass, M :
Q0 = 0 M 1
(iv) Gilliland-Sherwood correlation: Mass-transfer in falling liquid lms in terms
of Sh, Sherwood number, as a function of two other dimensionless numbers, Re, the
Reynolds number, and Sc, the Schmidt number:
Sh = 0 Re1 Sc2
16.12 Establish that the hat matrix,
1

XT
H = X XT X
is idempotent. Establish that (I - H) is also idempotent. As a result, establish that
= Hy, but
not only is y
H
y = Hy
Similarly, not only is e = (I H)y, but
(I H)e = (I H)y
where e is the residual vector dened in Eq (16.160).
16.13 The angles between three celestial bodies that aligned to form a triangle in
a plane at a particular point in time have been measured as y1 = 91 , y2 = 58
and y3 = 33 . Since the measurement device cannot determine these angles without
error, the results do not add up to 180 as they should; but arbitrarily forcing the
numbers to add up to 180 is ad-hoc and undesirable. Formulate the problem instead
as a constrained least squares problem
yi = i + i ; i = 1, 2, 3,
subject to the constraint:
1 + 2 + 3 = 180
and determine the least squares estimate of these angles. Conrm that these estimates add up to 180 .
16.14 When the data matrix X in the multiple regression equation
y = X +

718

Random Phenomena

is poorly conditioned, it was recommended in the text that the ridge regression
estimate:
RR = (XT X + k2 I)1 XT y

might provide a more reliable estimate.

RR is biased for the parameter vector .
(i) Given that E() = 0, show that
(ii) Given E(T ) = = 2 I as the error covariance matrix, determine the covariance matrix for the ridge regression estimate,
(
'
RR ) = E (
RR )(
RR )T
RR = V ar(
and compare it to the covariance matrix for the regular least squares estimate shown
in Eq (16.151).
16.15 An experimental investigation over a narrow range of the independent variable
x yielded the following paired data: (x1 = 1.9, y1 = 4.89); (x2 = 2.0, y1 = 4.95) and
(x3 = 2.1, y3 = 5.15). It is postulated that the relationship between y and x is the
two-parameter model
y = 0 + 1 x +
(i) First, conrm that the true values of the unknown parameters are 0 = 1 and
1 = 2, by computing for each indicated value of xi , the theoretical responses i
given by:
i = 1 + 2xi
and conrming that within the limits of experimental error, the predicted i matches
the corresponding experimental response, yi .
(ii) Next, obtain the data matrix X and conrm that XT X is close to being singular,
by computing the determinant of this matrix.
(iii) Determine the least squares estimates of 0 and 1 ; compare these to the true
values given in (i).
(iv) Next compute three dierent sets of ridge regression estimates for these parameters using the the values k2 = 2.0, 1.0, and 0.5. Compare these ridge regression
estimates to the true value.
(v) Plot on the same graph, the data, the regular least squares t and the best ridge
regression t.
16.16 Consider the data in the table below.
x
-1.00
-0.75
-0.50
-0.25
0.00
0.25
0.50
0.75
1.00

y
-1.9029
-0.2984
0.4047
0.5572
0.9662
2.0312
3.2286
5.7220
10.0952

Fit a series of polynomial models of increasing complexity as follows. First t

the linear, two-parameter model,
y = 0 + 1 x +

Regression Analysis

719

next t the quadratic model,

y = 0 + 1 x + 2 x 2 +
and then t the cubic model,
y = 0 + 1 x + 2 x 2 + 3 x 3 +
and nally t the quartic model,
y = 0 + 1 x + 2 x 2 + 3 x 3 + 4 x 4 +
In each case, note the values of the estimates for each parameter, check the R2 and
2
values and the signicance of each model parameter. Plot each model t to the
Radj
data and select the most appropriate model.
16.17 Refer to the data table in Exercise 16.16. This time use Gram orthogonal
polynomials with n = 7 to obtain linear, quadratic, cubic, and quartic ts to this
data sequentially as in Exercise 16.16. Compare the coecients of the orthogonal
polynomial ts among themselves and then compare them to the coecients obtained in Exercise 16.16.

APPLICATION PROBLEMS
16.18 A predictive model is sometimes evaluated by plotting its predictions directly
against the corresponding experimental data in an (x, y) plot: if the model predicts
the data adequately, a regression line t should be a 45 degree line with slope 1
and intercept 0; the residuals should appear as a random sequence of numbers that
are independent, normally distributed, and with zero mean and variance close to
measurement error variance. This technique is to be used to evaluate two models of
multicomponent transport as follows.
In Kerkhof and Geboers, (2005)7 , the authors presented a new approach to modeling multicomponent transport that is purported to yield more accurate predictions
than previously available models. The table below shows experimentally determined
viscosity (105 P a.s) of 12 dierent gas mixtures and the corresponding values predicted by two models: (i) the classical Hirschfelder-Curtiss-Bird (HCB) model8 , and
(ii) their new (KG) model.

7 Kerkhof, P.J.A.M, and M.A.M. Geboers, (2005). Toward a unied theory of isotropic
molecular transport phenomena, AIChE Journal, 51, (1), 79121
8 Hirschfelder J.O., C.F. Curtiss, and R.B. Bird (1964). Molecular Theory of Gases and
Liquids. 2nd printing. J. Wiley, New York, NY.

720

Random Phenomena
Viscosity, (105 P a.s)
Experimental
HCB
KG
Data
Predictions Predictions
2.740
2.718
2.736
2.569
2.562
2.575
2.411
2.429
2.432
2.504
2.500
2.512
3.237
3.205
3.233
3.044
3.025
3.050
2.886
2.895
2.910
2.957
2.938
2.965
3.790
3.752
3.792
3.574
3.551
3.582
3.415
3.425
3.439
3.470
3.449
3.476

(i) Treating the KG model prediction as the independent variable and the experimental data as the response variable, t a two-parameter model and thoroughly
evaluate the regression results, the parameter estimates, their signicance, R2 and
2
values, and the residuals. Plot the regression line along with a 95% condence
Radj
interval around the regression line. In light of this regression analysis, provide your
opinion about the authors claim that their model provides an excellent agreement
with the data.
(ii) Repeat (i) for the HCB model. In light of your results here and in (i), comment
on whether or not the KG model can truly be said to provide better predictions
than the HCB model.
16.19 In an attempt to quantify a possible relationship between the amount of re
damage caused by residential res and the distance from the residence to the closest
re station, the following data were acquired from a random sample of 12 recent res.
Distance from
Fire Station
x (miles)
1.8
4.6
0.7
3.4
2.3
2.6

Fire damage
y ($103 )
17.8
31.3
14.1
26.2
23.1
19.6

Distance from
Fire Station
x (miles)
5.5
3.0
4.3
1.1
3.1
2.1

Fire damage
y ($103 )
36.0
22.3
31.3
17.3
27.5
24.0

(i) Postulate an appropriate model, estimate the model parameters and evaluate
the model t.
(ii) An insurance company wishes to use this model to estimate the expected re
damage to two new houses, house A that is being built at a distance of 5 miles from
the nearest re station, and house B, 3 miles from the same re station. Determine
these estimates along with appropriate uncertainty intervals.
(iii) Is it safe to use this model to predict the re damage to a house C that is
being built 6 miles from the nearest re station? Regardless of your answer, provide
a prediction and an appropriate uncertainty interval.
16.20 Refer to Problem 16.19. Now consider that three new residential res have

Regression Analysis

721

occurred and the following additional data set has become available at the same time.
Distance from
Fire Station
x (miles)
3.8
4.8
6.1

Fire damage
y ($103 )
26.1
36.4
43.2

(i) Use recursive least squares to adjust the previously obtained set of parameter estimates in light of this new information. By how much have the parameters
changed?
(ii) Recalculate the estimated expected re damage values to houses A and B in
light of the new data; compare these values to the corresponding values obtained
in Exercise 16.19. Have these values changed by amounts that may be considered
practically important?
(iii) With this new data, is it safe to use the updated model to predict the re
damage to house C? Predict this re damage amount.
16.21 In Ogunnaike, (2006)9 , the data in the following table was used to characterize
the extent of DNA damage, , experienced by cells exposed to radiation of dose
(Gy), with the power-law relationship,
= 0 1
Radiation Dose
(Gy)
0.00
0.30
2.50
10.0

Extent of
DNA damage,
0.05
1.30
2.10
3.10

(i) Using an appropriate variable transformation, estimate the unknown model

parameters. Are both model parameters signicant at the 95% level? What do the
2
values suggest about the model t to the data?
R2 and Radj
(ii) The extent of DNA damage parameter, , is in fact the Poisson random variable
parameter, ; in this case, it represents what amounts to the mean number of strand
breaks experienced by a cell in a population of cells exposed to gamma radiation. If
a cell experiences a total of 2n strand breaks (or n double-strand breaks), n pulses
of the DNA-repair protein p53 will be observed in response. Use the model obtained
in (i) to determine the probability that upon exposure to radiation of 5 Gy, a cell
undergoes DNA damage that will cause the cells DNA-damage repair system to
respond with 3 or more pulses of p53.
16.22 Consider a manufacturing process for making pistons from metal ingots in
which each ingot produces enough material for 1,000 pistons. Occasionally, a piston
cracks while cooling after being forged. Previous research has indicated that, in a
batch of 1,000 pistons, the average number of pistons that develop a crack during
9 Ogunnaike, B. A. (2006). Elucidating the digital control mechanism for DNA damage
repair with the p53-Mdm2 System: Single cell data analysis and ensemble modeling, J.
Roy. Soc. Interface. 3, 175-184.

722

Random Phenomena

cooling is dependent on the purity of the ingots. Ingots of known purity were forged
into pistons, and the average number of cracked pistons per batch was recorded in
the table below.
Purity, x
Ave # of Cracked Pistons, y

0.94
4.8

0.95
4.6

0.96
3.9

0.97
3.3

0.98
2.7

0.99
2.0

Over the small range of purity, the dependence of y, the average number of
cracked pistons per batch, on purity x, may assumed to be linear, i.e.,

y = 1 x + 0

(16.214)

(i) Estimate the parameters 0 and 1 .

(ii) A steel mill claims to have produced 100 ingots with a purity of 96% (i.e.,
x = 0.96), 5 of which are used to manufacture 1,000 pistons. The number of cracked
pistons from each ingot/batch is recorded in the table below.
Ingot #
# of Cracked Pistons

1
4

2
6

3
3

4
6

5
6

Knowing the average number of cracked pistons per batch expected for ingots of
96% purity via Eq (16.214) and the parameters found in part (i), determine if the
steel mills purity claim is reasonable, based on the sample of 5 ingots.
(iii) Repeat part (ii) assuming that 20 ingots from the mill were tested instead of 5
but that the mean and variance for the 20 ingots are the same as was calculated for
the sample of 5 ingots in part (ii).
16.23 The table below, rst introduced in Chapter 12, shows city and highway
gasoline mileage ratings, in miles per gallon (mpg), for 20 types of two-seater automobiles, complete with engine characteristics, capacity (in liters) and number of
cylinders.
(i) Obtain an appropriate regression models relating the number of cylinders (as x)
to highway gas mileage and to city gas mileage (as y). At the 95% signicance level,
is there a dierence between the parameters of these dierent models?
(ii) By analyzing the residuals, do these models provide reasonable explanations of
how the number of cylinders a car engine has aects the gas mileage, either in the
city or on the highway?

Regression Analysis

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Car
Type and Model
Aston Marton V8 Vantage
Audi R8
Audi TT Roadster
BMW Z4 3.0i
BMW Z4 Roadster
Bugatti Veyron
Caddilac XLR
Chevrolet Corvette
Dodge Viper
Ferrari 599 GTB
Honda S2000
Lamborghini Murcielago
Lotus Elise/Exige
Mazda MX5
Mercedes Benz SL65 AMG
Nissan 350Z Roadster
Pontiac Solstice
Porsche Boxster-S
Porsche Cayman
Saturn SKY

Eng Capacity
(Liters)
4.3
4.2
2.0
3.0
3.2
8.0
4.4
7.0
8.4
5.9
2.2
6.5
1.8
2.0
6.0
3.5
2.4
3.4
2.7
2.0

723
# Cylinders
8
8
4
6
6
16
8
8
10
12
4
12
4
4
12
6
4
6
6
4

City
mpg
13
13
22
19
15
8
14
15
13
11
18
8
21
21
11
17
19
18
19
19

Highway
mpg
20
19
29
28
23
14
21
24
22
15
25
13
27
28
18
24
25
25
28
28

16.24 Refer to the data set in Problem 16.23. Repeat the analysis this time for
engine capacity. Examine the residuals and comment on any unusual observations.
16.25 The data in the table below shows the size of a random sample of 10 homes
(in square feet) located in the Mid-Atlantic region of the US, and the corresponding
amount of electricity used (KW-hr) monthly in each home.
From a scatter plot of the data, postulate an appropriate regression model and
estimate the parameters. Comment on the signicance of the estimated parameters.
Investigate the residuals and comment on the model t to the data. What do the
model parameters signify about how the size of a home in this region of the US
inuences the amount of electricity used?
Home Size
x (sq. ft)
1290
1350
1470
1600
1710
1840
1980
2230
2400
2930

Electricity
Usage y (KW-hr)
1182
1172
1264
1493
1571
1711
1804
1840
1956
1954

16.26 Some of the earliest contributions to chemical engineering science occurred in

the form of correlations that shed light on mass and heat transfer in liquids and gases.
These correlations were usually presented in the form of dimensionless numbers,
combinations of physical variables that enable general characterization of transport
and other phenomena in a manner that is independent of specic physical dimensions

724

Random Phenomena

of the equipment in question. One such correlation regarding mass-transfer in falling

liquid lms is due to Gilliland and Sherwood, (1934)10 . It relates the Sherwood
number, Sh, (a dimension number representing the ratio of convective to diusive
mass transport) to two other dimensionless numbers: the Reynolds number, Re, (a
dimensionless number that gives a measure of the ratio of inertial forces to viscous
forces) and the Schmidt number, Sc (the ratio of momentum diusivity (viscosity)
to mass diusivity; it represents the relative ease of molecular momentum and mass
transfer).
A sample of data from the original large set (of almost 400 data points!) is shown
in the table below. From a postulated model,
Sh = 0 Re1 Sc2
(i) Obtain estimates of the parameters 0 , 1 and 2 and compare them to the result
in the original publication, respectively, 0.023; 0.830; 0.440.

(ii) Using parameters estimated in (i), plot the data in terms of log(Sh/Sc2 ) vs
log Re along with your regression model and comment on the observed t.
Sh
43.7
21.5
42.9
19.8
24.2
88.0
93.0
70.6
32.3
56.0
51.6
50.7
26.1
41.3
92.8
54.2
65.5
38.2

Re
10800
5290
7700
2330
3120
14400
16300
13000
4250
8570
6620
8700
2900
4950
14800
7480
9170
4720

Sc
0.600
0.600
1.610
1.610
1.800
1.800
1.830
1.830
1.860
1.860
1.875
1.875
2.160
2.160
2.170
2.170
2.260
2.260

16.27 For ecient and protable operation, (especially during the summer months),
electrical power companies need to predict, as precisely as possible, Peak power
load (P ), dened as daily maximum amount of power required to meet demand.
The inability to predict P accurately and to provide sucient power to meet the
indicated demand is responsible in part for many blackouts/brownouts.
The data shown in the table below is a random sample of 30 daily high temperatures (T F ) and corresponding P (in megawatts) acquired between the months
of May and August in a medium-sized city.
10 Gilliland, E.R. and Sherwood, T.K., 1934. Diusion of vapors into air streams. Ind.
Engng Chem. 26, 516523.

Regression Analysis
T emp(F )
95
88
84
106
94
108
90
100
71
96
67
98
97
67
89

P (megawatts)
140.7
116.4
113.4
178.2
136.0
189.3
132.0
151.9
92.5
131.7
96.5
150.1
153.2
101.6
118.5

T emp( F )
79
76
87
92
68
85
100
74
89
86
75
70
69
82
101

725
P (megawatts)
106.2
100.2
114.7
135.1
96.3
111.4
143.6
103.9
116.5
105.1
99.6
97.7
97.6
107.3
157.6

(i) From the supplied data, obtain an equation expressing P as a function of

daily high temperature (T F ) and comment on your t.
(ii) Use the equation obtained in (i) to predict P for three dierent temperatures:
T = 65; T = 85; T = 110. Is it safe to use the model equation for these predictions? Why, or why not?
(iii) Predict the range of P corresponding to the following 2 F ranges in temperature:
68 T1 70; 83 T2 85; 102 T3 104
(iv) If it is desired to predict P to within 2% of nominal value, nd the maximum
range of uncertainty in daily high temperature forecasts that can be tolerated for
each of the following three cases:Tl = 69; Tm = 84; Th = 103.
16.28 The data in the table below, from Fern, (1983)11 is to be used to calibrate a
near-infra red instrument so that its reectance measurements (at a specied wavelength) can be used to infer the protein content in wheat.
Protein
Content %
y
9.23
8.01
10.95
11.67
10.41
9.51
8.67
7.75
8.05
11.39
9.95
8.25

Observed
Reectance
x
386
383
353
340
371
433
377
353
377
398
378
365

Protein
Content %
y
10.57
10.23
11.87
8.09
12.55
8.38
9.64
11.35
9.70
10.75
10.75
11.47

Observed
Reectance
x
443
450
467
451
524
407
374
391
353
445
383
404

11 Fern, A.T., (1983). A misuse of ridge regression in the calibration of a near-infra red
reectance instrument, Applied Statistics, 32, 7379

726

Random Phenomena

Obtain an expression for the calibration line relating the protein content to the
reectance measurement. Determine the signicance of the parameters. Plot the
data, model line and the 95% condence and prediction intervals. Comment objectively on how useful you expect the calibration line to be.
16.29 The following data set, obtained in an undergraduate uid mechanics lab
experiment, shows actual air ow rate measurements, determined at room temperature, 25 C, and 1 atmosphere of pressure, along with corresponding rotameter
readings.
Rotameter
Reading
x
20
40
60
80
100
120
140

Air Flow
rate cc/sec
y
15.5
38.3
50.2
72.0
111.1
115.4
139.0

(i) Determine an appropriate equation that can be used to calibrate the rotameter
and from which actual air ow rates can be determined for any given rotameter
2
values, is this a
reading. From the signicance of the parameters, the R2 and Radj
reliable expression to use as a calibration equations? From an appropriate analysis
of the residuals, comment on how carefully the experimental data were determined.
(ii) Plot the data and the regression equation along with 95% condence interval
and prediction interval bands.
(iii) Determine, along with 95% condence intervals, the expected value of the air
ow rates for rotameter readings of 70, 75, 85, 90, and 95.
16.30 The data shown in the table below, from Beck and Arnold, (1977)12 , shows
ve samples of thermal conductivity of a steel alloy as a function of temperature.
The standard deviation, i , associated with each measurement varies as indicated.

Sample
i
1
2
3
4
5

Temperature
xi ( C)
100
200
300
400
600

Thermal
Conductivity
ki W/m- C
36.3
36.3
34.6
32.9
31.2

Standard
Deviation
i
0.2
0.3
0.5
0.7
1.0

Over the indicated temperature range, the thermal conductivity varies linearly
with temperature; therefore the two-parameter model is deemed appropriate, i.e.,
ki = 0 + 1 xi + i
12 J.V. Beck and K. J. Arnold, (1977). Parameter Estimation in Engineering and Science,
J. Wiley, NY, p209.

Regression Analysis

727

(i) Determine the weighted least squares estimates of the parameters 0 and 1 using
as weights, wi = 1/i .
(ii) Determine ordinary least squares estimates of the same parameters using no
weights. Compare these estimates.
(iii) Plot the data along with the two regression equations obtained above. Which
one ts the data better?
16.31 The data table below is typical of standard tables of thermophysical properties of liquids and gases used widely in chemical engineering practice, especially in
process simulation. This specic data set shows the temperature dependence of the
heat capacity Cp of methylcyclohexane.
Temperature
Kelvin
150
160
170
180
190
200
210
220

Heat Capacity
Cp , KJ/kg K
1.426
1.447
1.469
1.492
1.516
1.541
1.567
1.596

Temperature
Kelvin
230
240
250
260
270
280
290
300

Heat Capacity
Cp , KJ/kg K
1.627
1.661
1.696
1.732
1.770
1.801
1.848
1.888

First t a linear model

C p = 0 + 1 T
2
check the signicance of the parameters, the residuals, and the R2 and Radj
values.
Then t the quadratic model,

C p = 0 + 1 T + 2 T 2
2
Again, check the signicance of the parameters, the residuals, and the R2 and Radj
values. Finally, t the cubic equation,

C p = 0 + 1 T + 2 T 2 + 3 T 3
Once more, check the signicance of the parameters, the residuals, and the R2 and
2
values.
Radj
Which of the three models is more appropriate to use as an empirical relationship
representing how the heat capacity of methylcyclohexane changes with temperature?
Plot the data and the regression curve of the selected model, along with the 95%
condence and prediction intervals.
16.32 The change in the bottoms temperature of a binary distillation column in
response to a pulse input in the steam ow rate to the reboiler is represented by the
following equation:
AK t/
e
(16.215)
y = T T0 =

where T0 is the initial (steady state) temperature before the perturbation; A is the
magnitude of the pulse input, idealized as a perfect delta function; t is time, and K
and are, respectively, the process gain, and time constant, unknown parameters

728

Random Phenomena

of the process when the dynamics are approximated as a rst order system13 . A
process identication experiment performed to estimate K and yielded the data
in the following table, starting from an initial temperature of 185 C, and using an
impulse input of magnitude A = 10.

Time
t (min)
0
1
2
3
5
10
15

Bottoms
Temperature
T ( C)
189.02
188.28
187.66
187.24
186.54
185.46
185.20

Even though Eq (16.215) presents a nonlinear regression problem, it is possible, via an appropriate variable transformation, to convert it to a linear regression
equation. Use an appropriate transformation and obtain an estimate of the process
parameters from the provided data.
Prior to performing the experiment, an experienced plant operator stated that
historically, in the operating range in question, the process parameters have been
characterized as K 2 and 5. How close are these guesstimates to the actual
estimated values?
16.33 Fit Antoines equation,
P vap = e

1
0 T +

to the data in the table below, which shows the temperature dependence of vapor
pressure for Toluene.
P vap (mm Hg)
Toluene
5
10
20
40
60
100
200
400
760
1520

T ( C)
-4.4
6.4
18.4
31.8
40.3
51.9
69.5
89.5
110.6
136.5

Use the tted model to interpolate and obtain expected values for the vapor
pressure of Toluene at the following temperatures: 0, 25, 50, 75, 100, and 125 ( C).
Since using linear regression to t the equation to the data will require a variable
transformation, obtain 95% condence intervals for these expected values rst in
13 B.A. Ogunnaike and W. H. Ray, (1994). Process Dynamics, Modeling and Control,
Oxford University Press, NY. Chapter 5.

Regression Analysis

729

the transformed variables and convert to approx condence interval in the original
variables.
16.34 In September 2007, two graduate students14 studying at the African Institute of Mathematical Sciences (AIMS) in Muizenberg, South Africa, took the following measurements of wingspan, (the ngertip-to-ngertip length of outstretched
hands) and height for 36 of their classmates.
Wingspan(cm)
182.50
167.50
175.00
163.00
186.50
168.50
166.50
156.00
153.00
170.50
164.50
170.50
173.00
189.00
179.50
174.50
186.00
192.00

Height(cm)
171.00
161.50
170.00
164.00
180.00
162.50
158.00
157.00
156.00
162.00
157.50
165.50
164.00
182.00
174.00
165.00
175.00
188.00

Wingspan(cm)
165.50
193.00
198.00
181.50
154.00
168.00
174.00
180.00
173.00
188.00
188.00
180.00
160.00
200.00
177.00
179.00
197.00
168.50

Height(cm)
158.00
189.50
183.00
181.00
157.00
165.00
166.50
172.00
171.50
179.00
176.00
178.00
163.00
184.00
180.00
169.00
183.00
165.00

(i) Obtain a two-parameter model relating height as y to wingspan as x. Are

the two parameters signicant? Inspect the residuals and comment on what they
imply about the data, the regression model, and how predictable height is from a
measurement of wingspan.
(ii) If two new students, one short (156.5 cm), one tall (187 cm), arrive in class
later in the school year, estimate the respective expected wingspans for these new
students, along with 95% condence intervals.
16.35 Heusner, (1991)15 compiled a collection of basal metabolism rate (BMR)
values, Q0 (in Watts), and body mass, M (g), of 391 mammalian species. The table
below is a sample from this collection. The relationship between these variables,
known as Kleibers law, is:
Q = 0 M 1
From this sample data, use a logarithmic transformation and estimate the unknown
parameters, along with 95% condence intervals. Use the unit of kg for M in the
regression equation.
In the same article, Heusner presented some theoretical arguments for why the
exponent should be 1 = 2/3 across species. Compare your estimate with this theo14 Ms. Tabitha Gathoni Mundia from Kenya and Mr. Simon Peter Johnstone-Robertson
from South Africa.
15 Heusner, A.A. (1991). Size and Power in Mammals, J. Exp. Biol. 160, 25-54.

730

Random Phenomena

retical value.

Species
Camelus dromedarius
Sus scrofa
Tragulus javanicus
Ailurus fulgens
Arctitis binturong
Canis latrans
Herpestes auropunctatus
Meles meles
Mustela frenata
Anoura caudifer
Chrotopterus auritus
Eptesicus fuscus
Macroderma gigas
Noctilio leporinus
Myrmecophaga tridactyla
Priodontes maximus
Crocidura suaveolens
Didelphis marsupialis
Lasiorhinus latifrons
Elephas maximus

Body Mass
M (g)
407000.00
135000.00
1613.00
5740.00
14280.00
10000.00
611.00
11050.00
225.00
11.50
96.10
16.90
148.00
61.00
30600.00
45190.00
7.50
1329.00
25000.00
3672 000.00

BMR
Q0 (Watts)
229.18
104.15
4.90
5.11
12.54
14.98
2.27
16.80
1.39
0.24
0.80
0.11
0.78
0.40
14.65
17.05
0.12
3.44
14.08
2336.50

Chapter 17
Probability Model Validation

17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.2 Probability Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.2.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.2.2 Transformations and Specialized Graph Papers . . . . . . . . . . . . . . . . . .
17.2.3 Modern Probability Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Safety Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yield Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Residual Analysis for Regression Model . . . . . . . . . . . . . . . . . . . . . . . .
Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.3 Chi-Squared Goodness-of-t Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.3.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.3.2 Properties and Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Poisson Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Binomial Special Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.4 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

732
732
733
734
736
736
737
737
737
737
739
739
741
742
743
745
746
747
750

An abstract analysis which is accepted

without any synthetic examination
of the question under discussion
is likely to surprise rather than enlighten us
Daniel Bernoulli (17001782)

In his pithy statement, All models are wrong but some are useful, the legendary George E. P. Box of Wisconsin was employing hyperbole to make a
subtle but important point. The point, well-known to engineers, is that perfection is not a prerequisite for usefulness in modeling (in fact, it can be an impediment). If complex, real-world problems are to become tractable, idealizing
assumptions are inevitable. But what is thus given up in perfection is more
than made up for in usefulness, so long as the assumptions can be validated
as reasonable. As a result, assessing the reasonableness of inevitable assumptions is or ought to be an important part of the modeling exercise; and
this chapter is concerned with presenting some techniques for doing just that
validating distributional assumptions. We focus specically on probability
plots and the chi-squared goodness-of-t test, two time-tested techniques that
also happen to complement each other perfectly in such a way that, with one
731

732

Random Phenomena

or the other, we are able to deal with both discrete and continuous probability
models.

17.1

Introduction

When confronted with a problem involving a randomly varying phenomenon, the approach we have advocated thus far involves rst characterizing
the random random phenomenon in question with an ideal probability model
in the form of the pdf, f (x; ), and then using the model to solve the problem
at hand. As we have seen in the preceding chapters, fully characterizing the
random phenomenon itself involves rst (i) postulating a candidate probability
model (for example, the Poisson model for the glass inclusions data of Chapter
1) based on an understanding of the underlying phenomenon; and then, (ii)
using statistical inference techniques to obtain (point and interval) estimates
of the unknown parameter vector based on sample data, X1 , X2 , . . . , Xn .
However, before proceeding to use the postulated model to solve the problem
at hand, it is always advisable to check to be sure that the model and the
implied underlying assumptions are reasonable. The question of interest is
therefore as follows:

Given sample data, X1 , X2 , . . . , Xn , obtained from a random variable, X, with postulated pdf, f (x), (and a corresponding cumulative distribution function (cdf), F (x)), is the postulated probability model reasonable?

The issue of probability model validation is considered in this chapter in

its broadest sense: from checking the reasonableness of any postulated probability model for a free-standing random variable, X, to the validation of
the near-universal normality assumption for the residuals in regression analysis discussed in Chapter 16. We focus on two approaches: (i) probability
plotting and (ii) the chi-squared goodness-of-t test. The rst technique is a
classic staple that has evolved from its strictly visual (and hence subjective),
old-fashioned probability paper roots, to the more modern, computer-based
incarnation, supported with more rigorous statistical analysis. It is better
suited to continuous probability models. The second technique is a standard
hypothesis test based on a Chi-squared test statistic. While more versatile,
being applicable to all random variable types, it is nevertheless applicable
more directly to discrete probability models.

Probability Model Validation

17.2
17.2.1

733

Probability Plots
Basic Principles

Consider that specic experimental data, x1 , x2 , . . . , xn , have been obtained from a population whose pdf is postulated as f (x) (so that the corresponding cdf, F (x) is also known). To use this data set to check the validity of
such a distributional assumption, we start by ordering the data from the smallest to the largest, as x(1) , x(2) , . . . , x(n) , such that: x(1) x(2) . . . x(n) , i.e.,
x(1) is the smallest of the set, followed by x(2) , etc., with x(n) as the largest. For
example, one of the data sets on the waiting time (in days) until the occurrence
of a recordable safety incident in a certain companys manufacturing site, was
given in Example 14.3 as S1 = {16, 1, 9, 34, 63, 44, 1, 63, 41, 29}, a sample of
size n = 10. This was postulated to be from an exponential distribution. When
rank ordered, this data set will be S1r = {1, 1, 9, 16, 29, 34, 41, 44, 63, 63}.
Now, observe that because of random variability, X(i)n , the ith ranked
observation of an n-sample set, will be a random variable whose observed value
will change from one sample to the next. For example, we recall from the same
Example 14.3 that a second sample of size 10 obtained from the same company
the following year, was given as: S2 = {35, 26, 16, 23, 54, 13, 100, 1, 30, 31}; it
is also considered to be from the same exponential distribution. When rank
ordered, this data set yields: S2r = {1, 13, 16, 23, 26, 30, 31, 35, 54, 100}. Note
that the value for x(5) , the fth ranked from below, is 29 for S1 , but 26 for
S2 .
Now, dene the expected value of X(i)n as:

E X(i)n = (i)n
(17.1)
This quantity can be computed for any given f (x), in much the same way
that we are able to compute the expectation, E(X), of the regular, unranked
random variable. The fundamental principle behind probability plots is that
if the sample data set truly came from a population with the postulated pdf,
then a plot of the ordered sample observations, x(i) , against their respective
expected values, (i)n , will lie on a straight line, with deviations due only
to random variability. Any signicant departure from this straight line will
indicate that the distributional assumption is not true.
However, with the exception of the very
simplest of pdfs, obtaining an
exact closed form expression for E X(i)n = (i)n is not a trivial exercise.
Nevertheless, using techniques that lie outside the intended scope of this book,
it is possible to show that the expression

i
1
E X(i)n = (i)n = F
; i = 1, 2, . . . , n
(17.2)
n 2 + 1
is a very good approximation that is valid for all pdfs (the constant is dened

734

Random Phenomena

later). Here F 1 (.) represents the inverse cumulative distribution function,

i.e., the value of x such that:

x
i
F (x) =
f ()d =
(17.3)
n 2 + 1

in which case

!

"
P X E X(i)n =

i
(17.4)
n 2 + 1
For our purposes here, the most important implication of this result is that if
i is the rank of the rank ordered observation, x(i) , from a population with a
postulated pdf f (x), then the associated theoretical cumulative probability is
(i)
100% percentile determined
(i)/(n2+1). In other words, the (n2+1)

from the theoretical cdf is E X(i)n .
The constant, , depends on sample size, n, and on f (x). However, for all
practical purposes, the value = 0.5 has been found to work quite well for
a wide variety of distributions, the exception being the uniform
distribution,

for which a closed form expression is easily obtained as E X(i)n = (i)n =
i/(n + 1), so that in this case, the appropriate value is = 0.
Observe in summary, therefore, that the principle behind the probability
plot calls for rank ordering the data, then plotting the rank ordered data, x(i) ,
versus its expected value, (i) (where for convenience, we have dropped the
indicator of sample size). From Eq (17.3), obtaining (i) requires computing
the value of the (i )/(n 2 + 1) quantile from the theoretical cdf, F (x).
A plot of (i) on a regular, uniform scale cartesian y-axis, against x(i) on the
similarly scaled cartesian x-axis, will show a straight line relationship when
the underlying assumptions are reasonable.

17.2.2

Transformations and Specialized Graph Papers

The amount of computational eort required in determining (i) from

Eqns (17.2) and (17.3) can be substantial. However, observe from (17.2) that,
instead of plotting (i) on a regular scale, scaling the y-axis appropriately by
F (.), the cdf in question, allows us to plot x(i) directly versus the cumulative
probability itself,
(i )
(17.5)
qi =
(n 2 + 1)

a much easier proposition that avoids the need to compute E X(i)n rst.
This is the fundamental concept behind the old-fashioned probability papers whose scales are tailored for specic cdfs. The most popular of these
specialized graph papers is the normal probability paper where the scale is
wider at the low end (qi 0), narrowing towards the middle (qi 0.5) and
then widening out again symmetrically towards the high end (qi 1). Most of
these pre-lined graph sheets were constructed for qi 100% (for percentile) on
the x-axis, along with a regular, uniform y-axis for the rank ordered data. The

Probability Model Validation

735

TABLE 17.1:

Table of values for safety data

probability plot
Original Rank Ordered Data Cumulative
Data
Data
Rank Probability

i
qi = i0.5
x
x(i)
10
16
1
1
0.05
1
1
2
0.15
9
9
3
0.25
34
16
4
0.35
63
29
5
0.45
44
34
6
0.55
1
41
7
0.65
63
44
8
0.75
41
63
9
0.85
29
63
10
0.95

resulting probability plots used to test for normality, are routinely referred to
as normal plots or normal probability plots. Corresponding graph papers
exist for the exponential, gamma, and a few other distributions.
The advent of modern computer software packages has not only made
these graphs obsolete; it has also made it possible for the technique to be
more objective. But before proceeding to discuss the modern approach, we
illustrate mechanics of the traditional approach to probability plotting rst
with the following example.
Example 17.1: PROBABILITY PLOTTING FOR SAFETY
INCIDENT DATA
Given the data set S1 = {16, 1, 9, 34, 63, 44, 1, 63, 41, 29} for the waiting
time (in days) until the occurrence of a recordable safety incident, generate the data table needed for constructing a probability plot for this
putative exponentially distributed random variable. Use = 0.5.
Solution:
Upon rank-ordering the data, we are able to generate Table 17.1 using
qi dened as:
i 0.5
(17.6)
qi =
10
A plot of x(i) directly versus qi on exponential probability paper should
yield a straight line if the underlying distribution is truly exponential.
Note that without the probability paper, it will be necessary to obtain
(i) rst from
(17.7)
F (x) = 1 ex
the cdf for the E() random variable, i.e.,
(i) =

1
ln (1 qi )

(17.8)

and x(i) may then be plotted against (i) on regular graph paper using
regular uniform scales on each axis.

736

Random Phenomena

With the old-fashioned probability paper, determining how straight the

plotted data truly were required subjective judgement. The advantage of the
visual presentation aorded by the technique therefore had to be tempered
by the subjectivity of the eyeball assessment. This drawback has now been
overcome by the more precise characterization aorded by modern computer
programs.

17.2.3

Modern Probability Plots

The advent of the computer, and the availability of statistical software

packages, has transformed the probability plot from a subjective, purely
graphical technique into a much more rigorous and eective probability model
validation tool. With programs such as MINITAB, a theoretical distribution
line representing the cdf corresponding to the postulated theoretical pdf is
obtained, along with a 95% condence interval, using the probability model
parameters estimated from the supplied data (or as provided directly by the
user, if available independently). In addition, an Anderson-Darling statistic
a numerical value associated with a test (with the same name) concerning
the distribution postulated for the population from which a sample was obtained (see Chapter 18)is computed along with a p-value associated with
the hypotheses
H0 :
Ha :

Data follow postulated distribution

Data do not follow postulated distribution

The displayed probability plot then consists of

1. The rank ordered data, on the x-axis, versus the cumulative probability
(as percentiles) on the appropriately scaled y-axis, but labeled in the
original untransformed values;
2. The theoretical cdf straight line t along with (approximate) 95% condence intervals; and
3. A list of basic descriptive statistics and a p-value associated with the
Anderson-Darling (AD) test.
As with other tests, if the p-value is lower than the chosen signicance level
(typically 0.05), then the null hypothesis is rejected, and we conclude that the
data do not follow the hypothesized distribution. In general, we are able to
come to one of three conclusions:
1. The model appears to be suciently adequate (if p > 0.05);
2. The model adequacy is indeterminable (if 0.01 < p < 0.05);
3. The model appears inadequate (if p < 0.01).

Probability Model Validation

17.2.4

737

Applications

To illustrate probability plotting using MINITAB, we now consider some

example data sets encountered in earlier chapters.
Safety Data
The probability plots for the safety data sets S1 and S2 given above are
obtained as follows: after entering the data into respective columns labeled X1
and X2, the sequence: Graph > Probability Plot> opens a self-explanatory
dialog box; and upon entering all the required information, the plots are generated as shown in Fig 17.1.
If we had mistakenly postulated the data set S2 to be normally distributed,
for the sake of illustration, the resulting normal probability plot is shown
in Fig 17.2. Even though the departure from the straight line t does not
appear to be too severe, the p-value of 0.045 means that we must reject the
null hypothesis at the 0.05 signicance level, thereby objectively putting the
adequacy of this (clearly wrong) postulate into question.
Yield Data
For this illustration, we return to the data sets introduced in Chapter 1
(and subsequently analyzed extensively in ensuing chapters, especially Chapters 14 and 15) on the yields YA and YB obtained from two competing chemical processes A and B. Each data set, we may recall, was assumed to have
come from Gaussian distributions. The probability plots for these data sets are
shown in Fig 17.3. These normal plots, the 95% condence intervals that completely envelope the entire data sets, and the indicated p-values, all strongly
suggest that these distributional assumptions appear valid.
Residual Analysis for Regression Model
The MINITAB regression analysis feature oers as one of its many options,
a whole series of residual plots. One of these plots is a Normal Plot of residuals which, as the name implies, is a probability plot for the residuals based
on the postulate that they are normally distributed. The specic plot shown
in Fig 17.4, is obtained directly from within the regression analysis carried out
for the data in Example 16.5, where the dependence of thermal conductivity,
k, on Temperature was postulated as a two-parameter regression model with
normally distributed random errors. Because it was generated as part of the
regression analysis, this plot does not contain the usual additional features of
the other probability plots. The residuals can be saved into a data column
and separately subjected to the sort of analysis to which the other data sets
in this section have been subjected. This is left to the reader as an exercise
(See Exercise 17.1).

738

Random Phenomena

Probability Plot of X1
Exponential - 95% CI
99

Mean
N
AD
P-Value

Percent

80
70
60
50
40

30.10
10
0.638
0.298

30
20
10
5
3
2
1

0.1

1.0

10.0
X1

100.0

Probability Plot of X2
Exponential - 95% CI
99

Mean
N
AD
P-Value

Percent

80
70
60
50
40

32.9
10
0.536
0.412

30
20
10
5
3
2
1

0.1

1.0

10.0
X2

100.0

1000.0

FIGURE 17.1: Probability plots for safety data postulated to be exponentially distributed, each showing (a) rank ordered data; (b) theoretical tted cumulative probability distribution line along with associated 95% condence intervals; (c) a list of summary
statistics, including the p-value associated with a formal goodness-of-t test. The indication from the p-values is that there is no evidence to reject H0 ; therefore the model
appears to be adequate

Probability Model Validation

739

Probability Plot of X2
Normal - 95% CI
99

Mean
StDev
N
AD
P-Value

95
90

32.9
27.51
10
0.701
0.045

Percent

80
70
60
50
40
30
20
10
5

-50

100

150

FIGURE 17.2: Probability plot for safety data S2 wrongly postulated to be normally distributed. The departure from the linear t does not appear too severe, but
the low/borderline p-value (0.045) objectively compels us to reject H0 at the 0.05 signicance level and conclude that the Gaussian model is inadequate for this data.

Others
In addition to the probability plots illustrated above for exponential, and
Gaussian distributions, MINITAB can also generate probability plots for several other distributions, including lognormal, gamma, and Weibull distributions, all continuous distributions.
Probability plots are not used for discrete probability models in part because the associated cdfs consist of a series of discontinuous step functions,
not smooth curves like continuous random variable cdfs. To check the validity
of discrete distributions such as the binomial and Poisson, it is necessary to
use the more versatile technique discussed next.

17.3
17.3.1

Chi-Squared Goodness-of-fit Test

Basic Principles

While the probability plot is fundamentally a graphically-based approach,

the Chi-squared goodness-of-t test is fundamentally computational.
We begin by classifying the sample data, x1 , x2 , . . . , xn , into m groups
(or bins), and obtaining from there a frequency distribution with fio as
the resulting observed frequency associated with the ith group precisely

740

Random Phenomena

Probability Plot of YA
Normal - 95% CI
99

Mean
StDev
N
AD
P-Value

95
90

75.52
1.432
50
0.566
0.135

Percent

80
70
60
50
40
30
20
10
5

Probability Plot of YB
Normal - 95% CI
99

Mean
StDev
N
AD
P-Value

95
90

72.47
2.764
50
0.392
0.366

Percent

80
70
60
50
40
30
20
10
5

65.0

67.5

70.0

72.5
YB

75.0

77.5

80.0

82.5

FIGURE 17.3: Probability plots for yield data sets YA and YB postulated to be normally
distributed. The 95% condence intervals around the tted line, along with the indicated
p-values, strongly suggest that the distributional assumptions appear to be valid.

Probability Model Validation

741

Normal Probability Plot

(response is k)
99

95
90

Percent

80
70
60
50
40
30
20
10
5

-3

-2

-1

0
1
Standardized Residual

FIGURE 17.4: Normal probability plot for the residuals of the regression analysis of the
dependence of thermal conductivity, k, on Temperature in Example 16.5. The postulated
model, a two-parameter regression model with Gaussian distributed zero mean errors,
appears valid.

how histograms are generated (see Chapter 12). From the postulated probability model, and its p parameters estimated from the sample data, the
theoretical (i.e., expected) frequency associated with each of the m groups,
i ; i = 1, 2, . . . , m, is then computed. If the postulated model is correct, the
observed and expected frequencies should be close. Because the observed frequencies are subject to random variability, their closeness to the corresponding theoretical expectations, quantied by,
C2 =

m
2

(f o i )
i

i=1

(17.9)

is a statistic that can be shown to have an approximate 2 () distribution

with = m p 1 degrees of freedom an approximation that improves
rapidly with increasing n.
The Chi-squared goodness-of-t test is a hypothesis test based on this
test statistic; the null hypothesis, H0 , that the data follow the postulated
probability model, is tested, at the signicance level, against the alternative
that the data do not follow the model. H0 is rejected if
C 2 > 2 ()

(17.10)

742

Random Phenomena

17.3.2

Properties and Application

The chi-squared goodness-of-t test is versatile in the sense that it can be

applied to both discrete and continuous random variables. With the former,
the data already occur naturally in discrete groups; with the latter, theoretical
frequencies must be computed by discretizing the continuous intervals. The
test is also transparent, logical (as evident from Eq (17.9)) and relatively easy
to perform. However, it has some important weaknesses also:
1. To be valid, the test requires that the expected frequency associated with
each bin must be at least 5. Where this is not possible, it is recommended
that adjacent bins be combined appropriately. This has the drawback
that the test will not be very sensitive to tails of postulated models
where, by denition, expected observations are few.
2. In general, the test lacks sensitivity in detecting inadequate models when
n is small.
3. Even though recommendations are available for how best to construct
discrete intervals for continuous random variables, both the number as
well as the nature of these discretized intervals are largely arbitrary
and can (and often do) aect the outcome of the test. Therefore, even
though applicable in principle, this test is not considered the best option
for continuous random variables.

Poisson Model Validation

The following example illustrates the application of the chi-squared test to
the glass manufacturing data presented in Chapter 1 and revisited in various
chapters including Chapters 8 (Example 8.8), Chapter 14 (Example 14.13)
and Chapter 15 (Example 15.15).
Example 17.2: VALIDATING THE POISSON MODEL FOR
INCLUSIONS DATA
The number of inclusions found in each of 60 square-meter sheets of
manufactured glass and presented in Table 1.2 in Chapter 1, was postulated to be a Poisson random variable with the single parameter, .
Perform a chi-squared goodness-of-t test on this data to evaluate the
reasonableness of this postulate.
Solution:
Recall from our various encounters with this data set that the Pois = 1.017. If
son model parameter, estimated from the data mean, is
the data are now arranged into frequency groups for 0, 1, 2, and 3+
inclusions, we obtain the following table:

Probability Model Validation

Data Group
(Inclusions)
0
1
2
3

Observed
Frequency
22
23
11
4

Poisson
f (x|)
0.3618
0.3678
0.1870
0.0834

743
Expected
Frequency
21.708
22.070
11.219
5.004

with the expected frequency obtained from 60 f (x|). We may now

compute the desired C 2 statistic as:
C2

=
=

(23 22.070)2
(11 11.219)2
(4 5.004)2
(22 21.708)2
+
+
+
21.708
22.070
11.219
5.004
0.249
(17.11)

The associated degrees of freedom is 4 1 1 = 2, so that from the

2 (2) distribution, we obtain
P (2 (2) > 0.249) = 0.883

(17.12)

As a result, we have no evidence to reject the null hypothesis, and hence

conclude that the Poisson model for this data set appears adequate.

Of course, it is unnecessary to carry out any of the indicated computations

by hand, even the frequency grouping. Programs such as MINITAB have chisquared test features that can be used for problems of this kind.
When MINITAB is used on this last example, upon entering the raw data
into a column labeled Inclusions, the sequence, Stat > Basic Stats >
Chi-Sq Goodness-of-Fit-Test for Poisson opens a self-explanatory dialog box; and making the required selections produces the following results,
just as we had obtained earlier:
Goodness-of-Fit Test for Poisson Distribution
Data column: Inclusions
Poisson mean for Inclusions = 1.01667
Poisson
Contribution
Inclusions Observed Probability Expected
to Chi-Sq
0
22
0.361799
21.7079
0.003930
1
23
0.367829
22.0697
0.039212
2
11
0.186980
11.2188
0.004267
3
4
0.083392
5.0035
0.201279
N DF Chi-Sq P-Value
60 2 0.248687
0.883
The rightmost column in the MINITAB output shows the individual contributions from each group to the chi-squared statistic an indication of how
the lack-of-t is distributed among the groups. For example, the group of 3
or more inclusions contributed by far the largest to the discrepancy between
observation and model prediction; but even this is not sucient to jeopardize the model adequacy. MINITAB also produces graphical representations
of these results, as shown in Fig 17.5.

744

Random Phenomena

Chart of Observed and Expected Values

Expected
Observ ed

Value

0
Inclusions

>=3

Chart of Contribution to the Chi-Square Value by Category

Contributed Value

0.20

0.15

0.10

0.05

0.00

>=3

Inclusions

FIGURE 17.5: Chi-Squared test results for inclusions data and a postulated Poisson
model. Top panel: Bar chart of Expected and Observed frequencies, which shows
how well the model prediction matches observed data; Bottom Panel: Bar chart of
contributions to the Chi-squared statistic, showing that the group of 3 or more inclusions
is responsible for the largest model-observation discrepancy, by a wide margin.

Probability Model Validation

745

Binomial Special Case

For the binomial case, where there are only two categories (x successes
and n x failures observed in n independent trials being the observed
frequencies in each respective category), for a postulated Bi(n, p) model, the
chi-squared statistic reduces to:
C2 =

[(n x) nq]2
(x np)2
+
np
nq

(17.13)

where q = 1 p, as usual. When this expression is consolidated to

C2 =

q(x np)2 + p[(n x) nq]2

npq

upon introducing q = 1 p for the rst term in the numerator and taking
advantage of the dierence of two squares result in algebra, the right hand
side of the equation rearranges easily to give the result:
C2 =

(x np)2
npq

(17.14)

which, if we take the positive square root, reduces to:

x np
Z=
npq

(17.15)

This, of course, is immediately recognized as the z-statistic for the (large

sample) Gaussian approximation to the binomial random variable used to
carry out the z-test of the observed mean against a postulated mean np, as
discussed in Chapter 15. Thus, the chi-squared test for the binomial model is
identical to the standard z-test when the population parameter p is specied
independently.

17.4

Summary and Conclusions

This chapter has been primarily concerned with examining two methods
for validating probability models: modern probability plots and the chi-square
goodness-of-t test. While we presented the principles behind these methods,
we concentrated more on applying them, particularly with the aid of computer
programs. With some perspective, we may now observe the following as the
main points of the chapter:
Probability plots augmented with theoretical model ts and p-values are
most appropriate for continuous models;

746

Random Phenomena

Chi-squared tests, on the other hand, are more naturally suited to discrete models (although they can also be applied to continuous models
after appropriate discretization).
As a practical matter, it is important to keep in mind that, just as with
other hypotheses tests, a postulated probability model can never be completely proven adequate by these tests (on the basis of nite sample data),
but inadequate models can be successfully identied as such. Still, it can be
dicult to identify inadequate models with these tests when sample sizes are
small; our chances of identifying inadequate models correctly as inadequate
improve signicantly as n . Therefore, as much sample data as possible should be used to validate probability models; and wherever possible, the
data set used to validate a model should be collected independently of that
used to estimate the parameters. Some of the end-of-chapter exercises and
applications problems are used to reinforce these points.
Finally, it must be kept in mind always that no model is (or can ever
be) perfect. The nal decision about the validity of the model assumptions
rests with the practitionerthe person who will ultimately use these models
for problem solvingand these tests should be considered properly only as
objective guides, not as nal and absolute arbiters.

REVIEW QUESTIONS
1. What is the primary question of interest in probability model validation?
2. What are the two approaches discussed in this chapter for validating probability
models?
3. Which approach is better suited to continuous probability models and which one
is applicable most directly to discrete probability models?
4. What is the fundamental principle behind probability plots?
5. What is the fundamental concept behind old-fashioned probability plots?
6. What hypothesis test accompanies modern probability plots?
7. What does a modern probability plot consist of?
8. Why are probability plots not used for discrete probability models?
9. What is a chi-squared goodness-of-t test?

Probability Model Validation

747

10. What are some weaknesses of the chi-squared goodness-of-t test?

11. The chi-squared goodness-of-t test for the binomial model is identical to what
familiar hypothesis test?

EXERCISES
17.1 In Example 16.5, a two-parameter regression model of how a metals thermal
conductivity varies with temperature was developed from the data shown here again
for ease of reference.
k (W/m- C)
93.228
92.563
99.409
101.590
111.535
115.874
119.390
126.615

Temperature ( C)
100
150
200
250
300
350
400
450

The two-parameter model was postulated for the relationship between k and
T with the implicit assumption that the errors are normally distributed. Obtain
the residuals from the least-squares t and generate a normal probability plot (and
ancillary analysis) of these residuals. Comment on the validity of the normality assumption for the regression model errors.
17.2 In Problem 16.28, the data in the table below1 was presented as the basis for
calibrating a near-infrared instrument to be used to determine protein content in
wheat from reectance measurements.
For this data set to produce a useful calibration curve, the regression model must
be adequate; and an important aspect of regression model adequacy is the nature
of its residuals. In this particular case, the residuals are required to be random and
approximately normally distributed. By analyzing the residuals from the regression
exercise appropriately, comment on whether or not the resulting regression model
should be considered as adequate.

1 Fern, A.T., (1983). A misuse of ridge regression in the calibration of a near-infrared

reectance instrument, Applied Statistics, 32, 7379

748

Random Phenomena
Protein
Content %
y
9.23
8.01
10.95
11.67
10.41
9.51
8.67
7.75
8.05
11.39
9.95
8.25

Observed
Reectance
x
386
383
353
340
371
433
377
353
377
398
378
365

Protein
Content %
y
10.57
10.23
11.87
8.09
12.55
8.38
9.64
11.35
9.70
10.75
10.75
11.47

Observed
Reectance
x
443
450
467
451
524
407
374
391
353
445
383
404

17.3 The following data is postulated to have been sampled from an exponential
E (4) population. Validate the postulated model appropriately. Repeat the validation
exercise as if the population parameter was unknown and hence must be estimated
from the sample data. Does knowing the population parameter independently make
any dierence in this particular case?
6.99
0.52
10.36
5.75

2.84
0.67
1.66
0.12

0.41
2.72
3.26
6.51

3.75
5.22
1.78
4.05

2.16
16.65
1.31
1.52

17.4 The data below are random samples from two independent lognormal distributions; specically, XL1 L(0, 0.25) and XL2 L(0.25, 0.25).
XL1
0.81693
0.96201
1.03327
0.84046
1.06731
1.34118
0.77619
1.14027
1.27021
1.69466

XL2
1.61889
1.15897
1.17163
1.09065
1.27686
0.91838
1.45123
1.47800
2.16068
1.46116

(i) Test the validity of these statements directly from the data as presented.
(ii) Test the validity of these statements indirectly by taking a logarithmic transformation of the data, and carrying out an appropriate analysis of the resulting
log-transformed data. Compare the results with those obtained in (i).
17.5 If X1 is a lognormal random variable with parameters (, 1 ), and X2 is a
lognormal random variable with parameters (, 2 ), it has been postulated that the
product:
Y = X1 X2

Probability Model Validation

749

has a lognormal distribution with parameters (, y ) where:

y = 1 + 2
(i) Conrm this result theoretically. (Hint: use results from Chapter 6 regarding the
sum of Gaussian random variables.)
(ii) In the data table below, X1 is a random sample drawn from a distribution purported to be L(0.25, 0.50); and X2 is a random sample drawn from a distribution
purported to be L(0.25, 0.25).
X1
1.16741
1.58631
2.00530
1.67186
1.63146
1.61738
0.74154
2.96673
1.50267
1.99272

X2
1.61889
1.15897
1.17163
1.09065
1.27686
0.91838
1.45123
1.47800
2.16068
1.46116

From this data set, obtain the corresponding values for Y dened as the product
Y = X1 X2 . According to the result stated and proved in (i), what is the theoretical
distribution of Y ? Conrm that the computed sample data set for Y agrees with
this postulate.
17.6 The data in the following table (Exercise 12.12) shows samples of size n = 20
drawn from four dierent populations postulated to be normal, N , lognormal L,
gamma G, and inverse gamma I, respectively.
XN
9.3745
8.8632
11.4943
9.5733
9.1542
9.0992
10.2631
9.8737
7.8192
10.4691
9.6981
10.5911
11.6526
10.4502
10.0772
10.2932
11.7755
9.3790
9.9202
10.9067

XL
7.9128
5.9166
4.5327
33.2631
24.1327
5.4151
16.9556
3.9345
35.0376
25.1182
1.1804
2.3503
15.6894
5.8929
8.0254
16.1482
0.6848
6.6974
3.6909
34.2152

XG
10.0896
15.7336
15.0422
5.5482
18.0393
17.9543
12.5549
9.6640
14.2975
4.2599
19.1084
7.0735
7.6392
14.1899
13.8996
9.7680
8.5779
7.5486
10.4043
14.8254

XI
0.084029
0.174586
0.130492
0.115567
0.187260
0.100054
0.101405
0.100835
0.097173
0.141233
0.060470
0.127663
0.074183
0.086606
0.084915
0.242657
0.052291
0.116172
0.084339
0.205748

750

Random Phenomena

(i) Validate these postulates using the full data sets. Note that the population parameters have not been specied.
(ii) Using only the top half of each data set, repeat (i). For this particular example, what eect, if any, does sample size have on the probability plots approach to
probability model validation?
17.7 The data in the table below was presented in Exercise 15.18 as a random sample of 15 observations each from two normal populations with unknown means and
variances. Test the validity of the normality assumption for each data set. Interpret
your results.
Sample
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

X
12.03
13.01
9.75
11.03
5.81
9.28
7.63
5.70
11.75
6.28
12.53
10.22
7.17
11.36
9.16

Y
13.74
13.59
10.75
12.95
7.12
11.38
8.69
6.39
12.01
7.15
13.47
11.57
8.81
13.10
11.32

APPLICATION PROBLEMS
17.8 The following data set, from a study by Lucas (1985)2 , shows the number of
accidents occurring per quarter (three months) at a DuPont company facility, over
a 10-year period. The data set has been partitioned into two periods: Period I is the
rst ve-year period of the study; Period II, the second ve-year period.

5
4
2
5
6

Period I
5 10 8
5 7
3
8 6
9
6 5 10
3 3 10

3
1
7
1
4

Period II
4 2 0
3 2 2
7 1 4
2 2 1
4 4 4

When the data set was presented in Problem 13.22, it was simply stated as a
matter of fact that a Poisson pdf was a reasonable model for representing this data.
2 Lucas

J. M., (1985). Counted Data CUSUMs, Technometrics, 27, 129144

Probability Model Validation

751

Check the validity of this assumption for Period I data and for Period II data separately.
17.9 Refer to Problem 17.8. One means by which one can test if two samples are
truly from the same population is to compare the empirical distributions obtained
from each data set against each other directly; if the two samples are truly from
the same distribution, within the limits of experimental error, there should be no
signicant dierence between the two empirical distributions. State an appropriate
null hypothesis and the alternative hypothesis and carry out a chi-squared goodnessof-t test of the empirical distribution of the data for Period I versus that of Period
II. State your conclusions clearly.
17.10 The table below (see Problem 9.40) shows frequency data on distances between DNA replication origins (inter-origin distances), measured in vivo in Chinese Hamster Ovary (CHO) cells by Li et al., (2003)3 , as reported in Chapter 7 of
Birtwistle (2008)4 . Phenomenologically, inter-origin distance should be a gamma distributed random variable and this data set has been analyzed in Birtwistle, (2008),
on this basis. Carry out a formal test to validate the gamma model assumption.
Interpret your results.
Inter-Origin
Distance (kb)
x
0
15
30
45
60
75
90
105
120
135
150
165

Relative
Frequency
fr (x)
0.00
0.02
0.20
0.32
0.16
0.08
0.11
0.03
0.02
0.01
0.00
0.01

17.11 The time in months between occurrences of safety violations in a toll manufacturing facility is shown in the table below for three operators, A, B, C.
A
B
C

1.31
1.94
0.79

0.15
3.21
1.22

3.02
2.91
0.65

3.17
1.66
3.90

4.84
1.51
0.18

0.71
0.30
0.57

0.70
0.05
7.26

1.41
1.62
0.43

2.68
6.75
0.96

0.68
1.29
3.76

It is customary to postulate an exponential probability model for this phenomenon. Is this a reasonable postulate for each data set in this collection? Support
your answer adequately.
3 Li, F., Chen, J., Solessio, E. and Gilbert, D. M. (2003). Spatial distribution and specication of mammalian replication origins during G1 phase. J Cell Biol 161, 257-66.
4 M. R. Birtwistle, (2008). Modeling and Analysis of the ErbB Signaling Network: From
Single Cells to Tumorigenesis, PhD Dissertation, University of Delaware.

752

Random Phenomena

17.12 The data table below (also presented in Problem 8.26) shows x, a count of the
number of species, x = 1, 2, . . . 24, and the associated number of Malayan butteries
that have x number of species. When the data was rst published and analyzed in
Fisher et al., (1943)5 , the logarithmic series distribution (see Exercise 8.13), with
the pdf,
f (x) =

px
; 0 < p < 1; x = 1, 2, . . . ,
x

where
=

1
ln(1 p)

was proposed as the appropriate model for the phenomenon in question. This pdf
has since become the model of choice for data involving this phenomenon.
x
No of species
(x)
Observed Frequency
x
No of species
(x)
Observed Frequency
x
No of species
(x)
Observed Frequency

118

Formally validate this model for this specic data set. What is the p-value associated with the test? What does it indicate about the validity of this model?
17.13 The data in the table below is the time-to-publication of 85 papers published
in the January 2004 issue of a leading chemical engineering research journal. (See
Problem 1.13).

5 Fisher, R. A., S. Corbet, and C. B. Williams. (1943). The relation between the number
of species and the number of individuals in a random sample of an animal population.
Journal of Animal Ecology, 1943: 4258.

Probability Model Validation

19.2
9.0
17.2
8.2
4.5
13.5
20.7
7.9
19.5
8.8
18.7
7.4
9.7
13.7
8.1
8.4
10.8

15.1
5.3
12.0
3.0
18.5
5.8
6.8
14.5
3.3
11.1
16.4
7.3
7.4
7.3
5.2
10.2
3.1

9.6
12.9
17.3
6.0
24.3
21.3
19.3
2.5
9.1
8.1
9.8
15.4
15.7
8.2
8.8
7.2
12.8

4.2
4.2
7.8
9.5
3.9
8.7
5.9
5.3
1.8
10.1
10.0
18.7
5.6
3.3
7.3
11.3
2.9

753
5.4
15.2
8.0
11.7
17.2
4.0
3.8
7.4
5.3
10.6
15.2
11.5
5.9
20.1
12.2
12.0
8.8

Given the nature of the phenomenon in question, postulate an appropriate model

and validate it. If your rst postulate is invalid, postulate another one until you obtain a valid model.
17.14 Refer to Problem 17.13. This time, obtain a histogram and represent the data
in the form of a frequency table. Using this discretized version of the data, perform
an appropriate test to validate the model proposed and validated in Problem 17.13.
Is there a dierence between the results obtained here and the ones obtained in Problem 17.13? If yes, oer an explanation for what may be responsible for this dierence.
17.15 The distribution of income of families in the US in 1979 (in actual dollars
uncorrected for ination) is shown in the table below (Prob 4.28):
Income level, x,
( $103 )
05
510
1015
1520
2025
2530
3035
3540
4045
4550
5055
> 55

Percent of Population
with income level, x
4
13
17
20
16
12
7
4
3
2
1
1

It has been postulated that the lognormal distribution is a reasonable model for
this phenomenon. Carry out an appropriate test to conrm or refute this postulate.
Keep in mind that the data is not in raw form, but has already been processed
into a frequency table. Interpret your results.

754

Random Phenomena

17.16 The appropriate analysis of over-dispersed Poisson phenomena with the negative binomial distribution was pioneered with the classic data and analysis of Greenwood and Yule (1920) data6 . The data in question, shown in the table below (see
Problem 8.28), is the frequency of accidents occurring, over a ve-week period, to
647 women making high explosives during World War I.
Number
of Accidents
0
1
2
3
4
5+

Observed
Frequency
447
132
42
21
3
2

First, determine from the data, parameter estimates for a Poisson model and
then determine k and p for a negative binomial model (see Problem 14.40). Next,
conduct formal chi-squared goodness-of-t tests for the Poisson model and then for
the negative binomial. Interpret your test results. From your analysis, which model
is more appropriate for this data set?.
17.17 Mee (1990)7 presented the following data on the wall thickness (in ins) of cast
aluminum cylinder heads used in aircraft engine cooling jackets. When presented in
Problem 14.44 (and following the maximum entropy arguments of Problem 10.15),
the data was assumed to be a random sample from a normal population. Validate
this normality assumption and comment on whether or not this is a reasonable assumption.
0.223
0.201

0.228
0.223

0.214
0.224

0.193
0.231

0.223
0.237

0.213
0.217

0.218
0.204

0.215
0.226

0.233
0.219

17.18 A sample of 20 silicon wafers selected and examined for aws produced the
result (the number of aws found on each wafer) shown in the following table.
3
4

0
1

0
2

2
3

3
2

0
1

3
2

2
4

1
0

2
1

When this data set was rst presented in Problem 12.20, it was suggested that the
Poisson model is reasonable for problems of this type. For this particular problem,
however, is this a reasonable model? Interpret your results.
17.19 According to census records, the age distribution of the inhabitants of the
United States in 1960 and in 1980 is as shown in the table below.
6 Greenwood M. and Yule, G. U. (1920) An enquiry into the nature of frequency distributions representative of multiple happenings with particular reference of multiple attacks
of disease or of repeated accidents. Journal Royal Statistical Society 83:255-279.
7 Mee, R. W., (1990). An improved procedure for screening based on a correlated, normally distributed variable, Technometrics, 32, 331337.

Probability Model Validation

Age Group
<5
59
1014
1519
2024
2529
3034
3539
4044
4549
5054
5559
6064
65

1960
20,321
18,692
16,773
13,219
10,801
10,869
11,949
12,481
11,600
10,879
9,606
8,430
7,142
16,560

755

1980
16,348
16,700
18,242
21,168
21,319
19,521
17,561
13,965
11,669
11,090
11,710
11,615
10,088
25,550

(i) It is typical to assume that such data are normally distributed. Is this a reasonable assumption in each case? (ii) Visually, the two distributions appear dierent.
But are they signicantly so? Carry out an appropriate test to check the validity of
any assumption of equality of these two age distributions.
17.20 In Problem 13.34 and in Example 15.1, it was assumed that the following
data, two sets of random samples of trainee scores from large groups of trainees
instructed by Method A and Method B, are both normally distributed.
Method A
Method B

71
72

75
77

65
84

69
78

73
69

66
70

68
77

71
73

74
65

68
75

Carry out an appropriate test and conrm whether or not such an assumption
is justied.
17.21 The data below is the computed fractional intensity, = Itest /(Itest + Iref ),
for a collection of special genes (known as housekeeping genes), where Itest is the
measured uorescence intensity under test conditions, and Iref , the intensity under
reference conditions. If these 10 genes are true housekeeping genes, then within the
limits of measurement noise, the computed values of should come from a symmetric Beta distribution with mean value 0.5. Use the method of moments to estimate
parameter values for the postulated Beta B(, ) distribution. Carry out an appropriate test to validate the Beta model hypothesis.
i
0.585978
0.504057
0.182831
0.426575
0.455191
0.804720
0.741598
0.332909
0.532131
0.610620

756

Random Phenomena

17.22 Padgett and Spurrier (1990)8 obtained the following data set for the breaking
strengths (in GPa) of carbon bers used in making composite materials.
1.4
3.2
2.2
1.8
1.6

3.7
1.6
1.2
0.4
1.1

3.0
0.8
5.1
3.7
2.0

1.4
5.6
2.5
2.5
1.6

1.0
1.7
1.2
0.9
2.1

2.8
1.6
3.5
1.6
1.9

4.9
2.0
2.2
2.8
2.9

3.7
1.2
1.7
4.7
2.8

1.8
1.1
1.3
2.0
2.1

1.6
1.7
4.4
1.8
3.7

(i) In their analysis, Padgett and Spurrier postulated a Weibull W (, ) distribution model with parameters = 2.0 and = 2.5 for the phenomenon in question.
Validate this model assumption by carrying out an appropriate test.
(ii) Had the model parameters not been given, so that their values must be determined from the data, repeat the test in (i) and compare the results. What does this
imply about the importance of obtaining independent parameter estimates before
carrying out probability model validation tests?
17.23 The data set below, from Holmes and Mergen (1992)9 , is a sample of viscosity
measurements taken from ten consecutive, but independent, batches of a product
made in a batch chemical process.
S10 = {13.3, 14.5, 15.3, 15.3, 14.3, 14.8, 15.2, 14.9, 14.6, 14.1}
Part of the assumption in the application noted in the reference is that this data
constitutes a random sample from a normal population with unknown mean and
unknown variance. Conrm whether or not this is a reasonable assumption.

8 Padgett, W.J. and J. D. Spurrier, (1990). Shewhart-type charts for percentiles of

strength distributions. J of Quality Tech. 22, 283388.
9 Holmes, D.S., and A.E. Mergen, (1992). Parabolic control limits for the exponentially
weighted moving average control charts Qual. Eng. 4, 487495.

Chapter 18
Nonparametric Methods

18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.2 Single Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.2.1 One-Sample Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Test Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison with Parametric Alternatives . . . . . . . . . . . . . . . . . . . . .
18.2.2 One-Sample Wilcoxon Signed Rank Test . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Test Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison with Parametric Alternatives . . . . . . . . . . . . . . . . . . . . .
18.3 Two Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.3.1 Two-Sample Paired Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.3.2 Mann-Whitney-Wilcoxon Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Test Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison with Parametric Alternatives . . . . . . . . . . . . . . . . . . . . .
18.4 Probability Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.4.1 The Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Test Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.4.2 The Anderson-Darling Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.5 A Comprehensive Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.5.1 Probability Model Postulate and Validation . . . . . . . . . . . . . . . . . . . . . .
18.5.2 Mann-Whitney-Wilcoxon Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

758
760
760
760
763
763
763
765
765
766
766
766
769
769
770
770
771
771
772
772
772
775
775
778
781
784

Just as some women are said to be handsome

though without adornment,
so this subtle manner of speech,
though lacking in articial graces, delights us.
Cicero (10643 BC)

Models of randomly varying phenomena, even in idealized form, have been

undeniably useful in providing solutions to many important practical problems problems involving free-standing random variables from identiable
populations, as well as regression modeling for relating one variable to another. After making a case in the last chapter for taking more seriously the
(oft-neglected) exercise of validating these models, and then presenting the
means for carrying out these validation exercises, we are now faced with a
very important question: what happens when the assumptions that are sup757

758

Random Phenomena

posed to make our data analysis lives easier are invalid? In particular, what
happens when real life does not cooperate with the Gaussian distributional
assumptions required for carrying out the t-tests and the F -tests, and other
similar tests on which important statistical decisions rest?
Many tiptoe nervously around such issues, in the hope that the repercussions of invalid assumptions will be minimal; some stubbornly refuse to believe
that violating any of these assumptions can really have any meaningful impact on their analyses; still others navely ignore such issues, primarily out of a
genuine lack of awareness of the distributional requirements at the foundation
of these analysis techniques. But none of this is an acceptable option for the
well-trained engineer or scientist.
The objective in this chapter is to present some viable alternatives to consider when distributional assumptions are invalid. These techniques, which
make little or no demand on the specic distributional structure of the population from whence the data came, are sometimes known as distribution-free
methods. And precisely because they do not involve population parameters,
in contrast to the distribution-based techniques, they are also known as nonparametric methods. Inasmuch as entire textbooks have been written on the
subject of nonparametric statistics complete treatises on statistical analysis
without the support (and, some would say, encumbrance) of hard probability
distribution models the discussion here will necessarily be limited to only
the few most commonly used techniques. And to put the techniques in proper
context, we will compare and contrast these nonparametric alternatives with
the corresponding parametric methods, where possible.

18.1

Introduction

There are at least two broad classes of problems for which the classical
hypothesis tests discussed in Chapter 15 are unsuitable:
1. When the underlying distributional assumptions (especially the Gaussian assumptions) are seriously violated;
2. When the data in question is ordinal only, not measured on a quantitative scale in which the distance between succeeding entities is uniform
or even meaningful (see Chapter 12).
In each of these cases, even in the absence of any knowledge of the mathematical characteristics of the underlying distributions, the sample data can always
be rank ordered by magnitude. The data ranks can then be used to analyze
such data with the little or no assumptions about the probability distributions
of the populations.

Nonparametric Methods

759

TABLE 18.1:

A
professors teaching evaluation
scores organized by student
type
Graduate Undergraduate
Students
Students
3
4
4
3
4
4
2
2
3
3
4
5
4
3
5
3
2
4
4
2
4
2
4
3
4
3
4

Such nonparametric (or distribution-free) techniques have the obvious advantage that they are versatile: they can be used in a wide variety of cases, even
when distributional assumptions are valid. By the same token, they are also
quite robust (there are fewer or no assumptions to violate). However, for the
very same reasons, they are not always the most ecient. For the same sample size, the power (the probability of correctly rejecting a false H0 ) is always
higher for the parametric tests discussed in Chapter 15 when the assumptions are valid, compared to the power of a corresponding nonparametric test.
Thus, if distributional assumptions are reasonably valid, parametric methods
are preferred; when the assumptions are seriously in doubt, nonparametric
methods provide a viable (perhaps even the only) alternative.
Finally, consider the case where a professor who taught an intermediate
level statistics course to a class that included both undergraduate and graduate students, is evaluated on a scale from 1 to 5, where 5 is the highest
rating. The evaluation scores from a class of 12 graduate and 15 undergraduate students is shown in Table 18.1. If the desire is to test whether or not
the professor received more favorable ratings from graduate students, observe
that this data set is ordinal only; it is not the usual quantitative data that
is amenable to the usual parametric methods. But this ordinal data can be
ranked since, regardless of the distance between each assigned number, we
know than 5 is better than 4, which is in turn better than 3, etc. The
question of whether graduate students rated the professor higher than under-

760

Random Phenomena

graduates can only be answered therefore by the nonparametric techniques

we will discuss in this chapter.
The discussion to follow will focus on the following techniques:
For single populations, the one-sample sign test and the one-sample
Wilcoxon signed rank test ;
For two populations, the Mann-Whitney-Wilcoxon (MWW) test ; and
For probability model validation, the Kolmogorov-Smirnov
Anderson-Darling tests

and

Nonparametric tests for comparing more than two populations will be discussed at the appropriate places in the next chapter. Our presentation here
will focus on the underlying principles, with some simple illustrations of the
mechanics involved in the computations; much of the computational details
will be left to, and illustrated with, computer programs that have facilitated
the modern application of these techniques.

18.2
18.2.1

Single Population
One-Sample Sign Test

The one-sample sign test is a test of hypotheses about the median, , of

a continuous distribution. Recall that the median is dened as that value for
which P (X ) = P (X ) = 0.5. The null hypothesis in this case is:
H0 : = 0

(18.1)

to be tested against the usual array of alternative:

HaL : < 0
HaR : > 0
Ha2 : = 0

(18.2)

Basic Test Characteristics

Given a random sample, X1 , X2 , . . . Xn , as with other tests, rst we need
an appropriate test statistic to be used in determining rejection regions from
such sample data. In this regard, consider the sample deviation from the
postulated median, Dmi , dened by:
Dmi = Xi 0

(18.3)

Nonparametric Methods

761

Furthermore, suppose that we are concerned for the moment only with the
sign of this quantity, the magnitude being of no interest for now: i.e., all
we care about is whether Xi shows a positive deviation from the postulated
median (when Xi > 0 ) or it shows a negative deviation (when Xi < 0 ).
Then,

+ Xi
(18.4)
D mi =
Xi
(Note that the requirement that X must be a continuous random variable,
rules out, in principle but not necessarily in practice the potential for a
tie where Xi exactly matches the value for the postulated median, since the
probability of this event occurring is theoretically zero. However, if by chance
Xi 0 = 0, this data is simply left out of the analysis.)
Observe that as dened in Eq (18.4), there are only two possible outcomes
for the quantity, Dmi , making it a classic Bernoulli random variable. Now, if
H0 is true, the sequence of Dmi observations should contain about an equal
number of + and entries. If T + is the total number of + signs (representing
the total number of positive deviations from the median, arising from observations greater than the median), then this random variable has a binomial
distribution with binomial probability of success pB = 0.5 if H0 is true.
Observe therefore that T + has all the properties of a useful test statistic.
The following are therefore the primary characteristics of this test:
1. The test statistic is T + , the total number of plus signs;
2. Its sampling distribution is binomial; specically, if H0 is true, T +
Bi(n, 0.5)
For any specic experimental data, the observed total number of plus signs,
t+ , can then be used to compute the rejection region or, alternatively, to test
directly for signicance by computing p-values as follows. For the one-sided
lower-tailed alternative, i.e., HaL ,
p = P (T + t+ |pB = 0.5)

(18.5)

the probability of the observing a total of t+ plus signs or fewer, out of a

sample of n, given equiprobable + or outcomes. For the one-sided uppertailed alternative, i.e., HaR , the required p-value is obtained from
p = P (T + t+ |pB = 0.5)
Finally, for the two-tailed test, Ha2 ,

2P (T + t+ |pB = 0.5);
p=

2P (T + t+ |pB = 0.5);

(18.6)

for t+ < n/2

for t+ > n/2

Let us illustrate this procedure with an example.

(18.7)

762

Random Phenomena
Example 18.1: MEDIAN OF EXPONENTIAL DISTRIBUTION
The data set S1 = {16, 1, 9, 34, 63, 44, 1, 63, 41, 29}, rst presented in
Example 14.3 (and later in Example 17.1), is the waiting time (in days)
until the occurrence of a recordable safety incident in a certain companys manufacturing site. This was postulated to be from an exponential distribution. Use the one-sample sign test to test the null hypothesis
that the median, = 21, versus the two-sided alternative that it is not.
Solution:
From the given data, and the postulated median, we easily generate the
following table:

Time Data
(S1 )

Deviation
from 0

16
1
9
34
63
44
1
63
41
29

5
20
12
13
42
23
20
42
20
8

Sign
Dmi

+
+
+

This shows six total plus signs, so that t+ = 6. Because of the two-sided
alternative hypothesis, and since t+ > 5, we need to compute P (x 6)
for a Bi(10, 0.5). This is obtained as:
P (x 6) = 1 P (x 5) = 1 0.623 = 0.377

(18.8)

The p-value associated with this sign test is therefore 2 0.377 = 0.754.
Therefore, there is no evidence to reject the null hypothesis that the
median is 21. (The sample median is 31.50, obtained as the average of
the 5th and 6th ranked sample data, 29 and 34.)

It is highly recommended, of course, to use statistical software packages

for carrying out such tests. To use MINITAB, upon entering the data into a
column labeled S1, the sequence Stat > Nonparametrics > 1-Sample Sign
opens the usual dialog box where the test characteristics are entered. The
MINITAB output is as follows:
Sign Test for Median: S1
Sign test of median = 21.00 versus not = 21.00
N Below Equal Above
P
Median
S1 10
4
0
6
0.7539
31.50
For large samples, the sampling distribution for T + , Bi(n, 0.5), tends

Nonparametric Methods

763

2
to
N (, ), with mean, = npB = 0.5n, and standard deviation, =
npB (1 pB ). Thus, the test statistic

T + 0.5n

0.5 n

(18.9)

can be used to carry out the sign test for large samples (typically > 10 15),
exactly like the standard z-test.
Comparison with Parametric Alternatives
The sign test is the nonparametric alternative to the one-sample z-test and
the one-sample t-test. These parametric tests are for hypotheses concerning
the means of normal populations; the sign test on the other hand is for the
median of any general population with a continuous probability distribution.
If data is susceptible to outliers or the distribution is long-tailed (i.e., skewed),
the sign test is more appropriate; if the distribution is close to normal, the
parametric t- and z-tests will perform better.

18.2.2

One-Sample Wilcoxon Signed Rank Test

The one-sample Wilcoxon signed rank test is also a test of hypotheses

about the median of continuous distributions. It goes one step further than
the sign test by taking into account the magnitude of the deviation from the
median, Dmi , in addition to the sign. However, it is restricted to symmetric distributions and therefore not applicable to skewed distributions. (The
recommended option for skewed distributions is the more general, but less
powerful, sign test discussed above.)
Basic Test Characteristics
The test concerns the median, , of a continuous and symmetric distribution, with the null and alternative hypotheses as stated in Eqs (18.1) and
(18.2). From the usual sample data, obtain, as before, Dmi = Xi 0 , the
deviation from the median. The test calls rst for ranking the absolute |Dmi |
in ascending order, and subsequently attaching to these ranks, the signs corresponding to the original Dmi values, to obtain the signed ranks.
Let W + be the sum of the ranks with positive signs, (i.e., where Dmi > 0);
and W , the sum of the ranks when Dmi < 0. (For obvious reasons, W + is
known as the positive rank sum.) These are the basic statistics used for
the Wilcoxon signed rank test. Dierent variations of the test use dierent
versions of these test statistics. For example, some use one or the other of W +
or W , some use max(W + , W ) and others min(W + , W ). MINITAB and
several other software packages base the test on W + as will the discussion in
this section.
The statistic, W + , has some distinctive characteristics: for example, for a
sample size of n, the largest value it can attain is (1 + 2 + . . .+ n) = n(n + 1)/2

764

Random Phenomena

when the entire sample exceeds the postulated value for the median, 0 . It can
attain a minimum value of 0. Between these extremes, W + takes other values
that are easily determined via combinatorial computations. And now, if H0 is
true, the sampling distribution for W + can be computed numerically and used
to determine signicance levels. The computations involved in determining
the sampling distribution of W + under H0 are cumbersome analytically, but
relatively easy to execute with a computer program; the test is thus best
carried out with statistical software packages.
As usual, large sample approximations exist. In particular, it can be shown
that the sampling distribution for W + tends to N (, 2 ) for large n, with
= n(n + 1)/4 and 2 = n(n + 1)(2n + 1)/24. But it is not recommended to
use the normal approximation because it is not suciently precise; besides,
the computations involved in the exact test are trivial for computer programs.
We use the next example to illustrate the mechanics involved in this test
before completing the test itself with MINITAB.
Example 18.2: CHARACTERIZING SOME HOUSEKEEPING GENES
In genomic studies, genes that are involved in basic functions needed to
keep the cell alive are always turned on (i.e., they are constitutively
expressed). Such genes are known colloquially as housekeeping genes.
Because their gene expression status hardly changes, they are sometimes used to calibrate experimental systems for measuring changes
in gene expression. Data on 10 such putative housekeeping genes has
been selected from a larger set of microarray data and presented as ,
the fractional intensity, Itest /(Itest + Iref ), where Itest is the measured
uorescence intensity under test conditions, and Iref , the intensity under reference conditions. Within the limits of random variability, the
values of determined for housekeeping genes should come from a symmetric distribution scaled between 0 and 1. If these 10 genes are true
housekeeping genes, the median of the data population for should
be 0.5. To illustrate the mechanics involved in using the one-sample
Wilcoxon signed rank test to test this hypothesis against the alternative that the median is not 0.5, the following table is a summary of the
raw data and the subsequent analysis required for carrying out this test.

i
0.585978
0.504057
0.182831
0.426575
0.455191
0.804720
0.741598
0.332909
0.532131
0.610620

Dmi = i 0.5
0.085978
0.004057
0.317169
0.073425
0.044809
0.304720
0.241598
0.167091
0.032131
0.110620

|Dmi |

Rank

0.085978
0.004057
0.317169
0.073425
0.044809
0.304720
0.241598
0.167091
0.032131
0.110620

5
1
10
4
3
9
8
7
2
6

Signed
Rank
5
1
10
4
3
9
8
7
2
6

Nonparametric Methods

765

The rst column is the raw fractional intensity data; the second
column is the deviation from the median whose absolute value is shown
in column 3, and ranked in column 4. The last column shows the signed
rank. Note that in this case, w+ = 31, the sum of all the ranks carrying
the plus sign, i.e., (5 + 1 + 9 + 8 + 2 + 6).

When MINITAB is used to carry out this test, the required sequence is Stat
> Nonparametrics > 1-Sample Wilcoxon; the result is shown below:
Wilcoxon Signed
Test of median =
N for
N Test
Phi 10
10

Rank Test: Phi

0.5000 versus median not = 0.5000
Wilcoxon
Estimate
Statistic
P
Median
31.0
0.760
0.5186

The high p-value indicates that there is no evidence to support rejecting

the null hypothesis. We therefore conclude that the median is likely to be 0.5
and that the selected genes appear to be true housekeeping genes.
As with the sign test, occasionally Xi exactly equals the postulated median, 0 , creating a tie, even though the theoretical probability that this will
occur is zero. Under these circumstance, most software packages simply set
such data aside, and the output will reect this. For example, in MINITAB,
the "N for Test" entry indicates how many of the original sample of size
N actually survived to be used for the test. In the gene expression example
above there were no ties, since was computed to a large enough number of
decimal places.
Comparison with Parametric Alternatives
The symmetric distribution requirement for the one-sample Wilcoxon
signed rank test might make it appear to be a direct nonparametric alternative to the one-sample z- or t-tests, but this is not quite true. For normally
distributed populations, the Wilcoxon signed rank test is slightly less powerful
than the t-test. In any event, there is no reason to abandon the classical parametric tests when the Gaussian assumption is valid. It is for other symmetric
distributions, such as the uniform, or the symmetric beta that the Wilcoxon
signed rank test is particularly useful, where the Gaussian assumption does
not readily hold.

18.3

Two Populations

The general problem involves two separate and mutually independent populations, with respective unknown medians 1 and 2 . As with the parametric

766

Random Phenomena

tests, we are typically concerned with testing hypotheses about the dierence
between these two medians, i.e.,
1 2 = .

(18.10)

Depending on the nature of the data sets, we can identify two categories: (i)
the special case of paired data; and (ii) the more general case of unpaired
samples.

18.3.1

Two-Sample Paired Test

In this case, once the dierences between the pairs, Di = X1i X2i have
been obtained, this can be treated exactly like the one-sample case presented
above. The hypothesis will be tested on a postulated value for the median of
Di , say 0 , which need not be zero. When 0 = 0, the test reduces to a test of
the equality of the two population medians. In any event, the one-sample tests
discussed above, either the sign test, or the Wilcoxon signed rank test (if the
distribution of the dierence, Di can be considered as reasonably symmetric)
can now be applied. No additional considerations are needed.
Thus, the two-sample paired test is exactly the same as the one-sample
test when it is applied to the paired dierences.

18.3.2

Mann-Whitney-Wilcoxon Test

The Mann-Whitney-Wilcoxon (MWW) test (also known as the two-sample

Wilcoxon rank sum test) is a test of hypothesis regarding the median of two
independent populations with identical, continuous distributions that are possibly shifted apart. The test is also applicable to discrete distributions provided
the scale is ordinal. The typical null hypothesis, written in general form, is
H0 : 1 2 = 0

(18.11)

where i is the median for distribution i. This is tested against the usual
triplet of alternatives. For tests of equality, 0 = 0.
Basic Test Characteristics
The test uses a random sample of size n1 from population 1,
X11 , X12 . . . , X1n1 , and another of size n2 from population 2, X21 , X22 . . . , X2n2 ,
where n1 need not equal n2 . First, the entire nT = n1 + n2 observations are
combined and ranked in ascending order as though they were from the same
population. (Identical observations are assigned equal ranks determined as the
average of the individual assigned ranks had they been dierent).
Two related statistics can now be identied: W1 , the sum of the ranks in
sample 1, and W2 the equivalent sum for sample 2. Again, we note that
W1 + W2 =

nT (nT + 1)
2

(18.12)

Nonparametric Methods

767

the sum of the rst nT integers. And now, because this sum is xed, observe
that a small value for one (adjusted for the possibility that n1 = n2 ) automatically implies a large value for the other, and hence a large dierence
between the two. Also note that the maximum attainable value for W1 is
n1 (n1 + 1)/2, and for W2 , n2 (n2 + 1)/2. The Mann-Whitney U test statistic
used to determine signicance levels, is dened as follows
U1

n1 (n1 + 1)
2
n2 (n2 + 1)
W2
2
min(U1 , U2 )
W1

(18.13)

The original Wilcoxon test statistic is slightly dierent but leads to equivalent
results; it derives from Eq (18.12), and makes use of the larger of the two rank
sums, i.e.,
nT (nT + 1)
W2
(18.14)
W1 =
2
if W1 > W2 and reversed if W1 < W2 .
The sampling distribution for U or for W1 , (or W2 ) are all somewhat like
that for W + above; they can be determined computationally for any given
n1 and n2 and used to compute signicance levels. As such, the entire test
procedure is best carried out with the computer.
We now use the next example to illustrate rst the mechanics involved in
carrying out this test, and then complete the test with MINITAB.
Example 18.3: TESTING FOR SAFETY PERFORMANCE
IMPROVEMENT
Revisit the problem rst presented in Example 14.3 regarding a companys attempt to improve its safety performance. Specically, we recall
that to improve its safety record, the company began tracking the time
in between recordable safety incidents. During the rst year of the program, the waiting time (in days) until the occurrence of a recordable
safety incident was obtained as
S1 = {16, 1, 9, 34, 63, 44, 1, 63, 41, 29}
The data record for the second year of the program is
S2 = {35, 26, 16, 23, 54, 13, 100, 1, 30, 31}
On the basis of this data, which shows a mean time to incident as 30.1
days for the rst year and 32.9 days for the second year, the companys
safety coordinator is preparing to make the argument to upper management that there has been a noticeable improvement in the companys
safety record from Year 1 to Year 2. Is there evidence to support this
claim?

768

Random Phenomena
Solution:
First, from phenomenological reasonings, we expect the data to follow the exponential distribution. This was conrmed as reasonable in
Chapter 17. When this fact is combined with a sample size that is relatively small, this becomes a clear case where the t-test is inapplicable.
We could use distribution-appropriate interval estimation procedures to
answer this question (as we did in Example 14.14 of Chapter 14). But
such procedures are too specialized and involve custom computations.
Problems of this type are ideal for the Mann-Whitney-Wilcoxon
test. A table of the data and a summary of the subsequent analysis
required for this test is shown here.

Year 1, S1

Rank

Year 2, S2

Rank

16
1
9
34
63
44
1
63
41
29

6.5
2.0
4.0
13.0
18.5
16.0
2.0
18.5
15.0
10.0

35
26
16
23
54
13
100
1
30
31

14.0
9.0
6.5
8.0
17.0
5.0
20.0
2.0
11.0
12.0

First, note how the complete set of 20 total observations have been
combined and ranked. The lowest entry, 1, occurs in 3 places twice
in sample 1, and once in sample 2. They are each assigned the rank of
2 which is the average of 1+2+3. The other ties, two entries of 16, and
two of 63 are given respective ranks of 6.5 and 18.5.
Next, the sum of ranks for sample 1 is obtained as w1 = 105.5; the
corresponding sum for sample 2 is w2 = 104.5. Note that these sum up
to 210, which is exactly nT (nT + 1)/2 with nT = 20; the larger w1 is
used to determine signicance level. But even before the formal test is
carried out, note the closeness of these two rank sums (the sample sizes
are the same for each data set). This already gives us a clue that we
are unlikely to nd evidence of signicant dierences between the two
medians.

To carry out the MWW test for this last example using MINITAB, the
required sequence is Stat > Nonparametrics > Mann-Whitney; the ensuing
results are shown below.
Mann-Whitney Test and CI: S1, S2
N Median
S1 10
31.5
S2 10
28.0
Point estimate for ETA1-ETA2 is -0.00

Nonparametric Methods

769

95.5 Percent CI for ETA1-ETA2 is (-25.01,24.99)

W = 105.5
Test of ETA1 = ETA2 vs ETA1 not = ETA2 is significant at 1.0000
The test is significant at 1.0000 (adjusted for ties)
From here, we see that the p-value is 1.000, implying that there is no evidence
to reject the null hypothesis, H0 : 1 = 2 . Notice that the W value reported
for the test is the same as w1 obtained above, the larger of the two rank sums.
The next example illustrates the MWW test for discrete ordinal data.
The primary point to watch out for in such cases is that because the data is
discrete, there is a much higher chance that there will be ties.
Example 18.4: MWW TEST FOR DISCRETE ORDINAL
DATA
From the data in Table 18.1, use MINITAB to test the hypothesis that
the professor in question received equal ratings from both undergraduates and graduate students.
Solution:
The results of the MINITAB analysis is shown below:
Mann-Whitney Test and CI: Grad, Undergrad
N Median
Grad
12
4.000
Undergrad 15
3.000
Point estimate for ETA1-ETA2 is -0.000
95.2 Percent CI for ETA1-ETA2 is (0.000,1.000)
W = 188.0
Test of ETA1 = ETA2 vs ETA1 not = ETA2 is significant at 0.3413
The test is significant at 0.3106 (adjusted for ties)
Even though the median rating is 4.0 for the graduate students, and
3.0 for the undergraduates, the result, with p = 0.34 (p = 0.32 when
adjusted for the ties), indicates that there is no evidence of a dierence
between the ratings given by the undergraduate students and those by
graduate students.

Comparison with Parametric Alternatives

The MWW test is generally considered as a direct nonparametric alternative to the two-sample t-test. When the populations are normal, the former
test is somewhat less powerful than the latter with pooled sample variance;
in most other cases, the MWW test is quite appreciably more powerful. However, if the two populations being compared are dierent (in shape and/or have
dierent standard deviations, etc) the parametric two-sample t-test without
pooling variances may be the better option.

770

18.4

Random Phenomena

Probability Model Validation

When the hypothesis test to be performed involves not just the mean,
median or variance of a distribution but the entire distribution itself, the
truly distribution-free approach is the Kolmogorov-Smirnov (K-S) test. The
critical values of the test statistic (that determine the rejection region) are
entirely independent of the specic distribution being tested. This makes the
K-S test truly nonparametric; it also makes it less sensitive, especially to
discrepancies between the observed data and the tail area characteristics of
many distributions. The Anderson-Darling (A-D) test, a modied version of
the K-S test designed to improve the K-S test sensitivity, achieves its improved
sensitivity by making explicit use of the specic distribution to be tested in
computing the critical values. This introduces the primary disadvantage that
critical values must be custom-computed for each distribution. But what is
lost in generality is more than made up for in overall improved performance.
We now review, briey, the key characteristics of these two tests.

18.4.1

The Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov (K-S) test is a test concerning the distribution

of a population from which a sample was obtained. It is based on the empirical
cumulative distribution function determined from the sample data and applies
only to continuous distributions.
Basic Test Characteristics
Let X1 , X2 , . . . , Xn , be a random sample drawn from a population with
a postulated pdf f (x), from which the corresponding cdf, F (x), follows. If
the random sample is ordered from the smallest to the largest as follows:
X(1) X(2) . . . X(n) , then, by denition, the empirical (data-based)
probability that X X(i) is the ratio of the number of observations less than
X(i) (say n(i)) to the total number of observations, n. Thus, the empirical
cumulative distribution function is dened as:
FE (i) =

n(i)
n

(18.15)

The null and alternative hypotheses are:

H0 :
Ha :

The sample data follow specied distribution

The sample data do not follow specied distribution

Let F (x(i) ) represent the theoretical cumulative probability P (X x(i) ) computed for a specic observation, x(i) , using the theoretical cdf postulated for
the population from which the data is purported to have come. The K-S test

Nonparametric Methods

771

is based on the dierence between this theoretical cdf and the empirical cdf
as in Eq (18.15). The test statistic is:

1
i1
D = max
F (X(i) )
(18.16)
F (X(i) )
,
1<i<n
n
n
As with other hypothesis tests, the null hypothesis is rejected at the signicant
level of if D > D , where the critical value D is typically obtained from
computations easily carried out in software packages.
Key Features
The primary distinguishing feature of the K-S test is that the sampling
distribution of its test statistic, D, is entirely independent of the postulated
distribution that is being tested. This makes the test truly distribution-free. It
is a nonparametric alternative to the chi-squared goodness-of-t test discussed
in Chapter 17. Unlike that test which requires large sample sizes for the 2
distribution approximation for the test statistic to be valid, the K-S test is an
exact test.
A key primary limitation is its restriction to continuous distributions that
must be completely specied, i.e., the distribution parameters cannot be estimated from sample data. Also, because it is based on a single point with the
worst distance between theoretical and empirical distributions, the test is
prone to ignoring less prominent but still important mismatches at the tails,
in favor of more inuential mismatches at the center of the distribution. The
K-S test is therefore more likely to have less power than a test that employs
a more broadly based test statistic that is evenly distributed over the entire
variable space. This is the raison detre for the Anderson-Darling test which
has all but supplanted the K-S test in many appplications.

18.4.2

The Anderson-Darling Test

The Anderson-Darling test1 is a modication of the K-S test that (a) uses
the entire cumulative distribution (not just the single worst point of departure
from the empirical cdf), and hence, (b) gives more weight to the distribution
tails than is possible with the K-S test.
The test statistic is:
A2 = n

n

2i 1
i=1

!
"
ln F (X(i) ) + ln 1 F (X(n+1i) ) .

(18.17)

However, the sampling distribution of A2 depends on the postulated distribution function; critical values therefore must be computed individually for
each distribution under consideration. Nevertheless, these critical values are
1 Anderson, T. W.; Darling, D. A. (1952). Asymptotic theory of certain goodness-of-t
criteria based on stochastic processes. Annals of Mathematical Statistics 23: 193212

772

Random Phenomena

available for many important distributions, including (but not limited to) the
normal, lognormal, exponential, and Weibull distributions. The test is usually
carried out using statistical software packages.
Key Features
The primary distinguishing feature of the A-D test is that it is more sensitive than the K-S test, but for this advantage, the critical values for A2
must be custom-calculated for each distribution. The test is applicable even
with small sample sizes, n < 20. It is therefore generally considered a better
alternative to the chi-square and K-S goodness-of-t tests.

18.5

A Comprehensive Illustrative Example

In experimental and theoretical neurobiology, action potentials the

spike trains generated by neurons are used extensively to study nerve-cell
activity. Because these spike trains and the dynamic processes that cause them
are random, probability and statistics have been used to great benet for such
studies. For example, it is known that the distribution of interspike intervals
(ISI) the elapsed time between the appearance of two consecutive spikes in
the spike train can reveal something about synaptic mechanism2 .
Table 18.2 contains 100 ISI measurements (in milliseconds) for the pyramidal tract cell of a monkey when awake (PT-W), and when asleep (PT-S),
extracted from a more comprehensive set involving a few other cell types. The
objective is to determine whether the activity of the pyramidal tract cell of a
monkey is the same whether asleep or awake.

18.5.1

Probability Model Postulate and Validation

First, from phenomenological considerations, these interspike intervals represent time to the occurrence of several poisson events, suggesting that they
might follow the gamma (, ) distribution, where will represent the effective number of poisson events leading to each observed action potential,
and , the mean rate of occurrence of the poisson events. A histogram of
both data sets, with superimposed gamma distribution ts, is shown in Fig
18.1. Visually, the ts appear quite good, but we can make more quantitative
statements by carrying out formal probability model validation.
Because the postulated gamma model is continuous, we choose the probability plot approach and the Anderson-Darling test. The null hypothesis is
that both sets of ISI data follow the gamma distribution; the alternative is
2 Braitenberg, (1965): What can be learned from spike interval histograms about synaptic mechanism? J. Theor. Biol. 8, 419425)

Nonparametric Methods

773

TABLE 18.2:

Interspike intervals data for the

pyramidal tract cell of a monkey when awake, (PT-W),
and when asleep (PT-S)
PT-W ISI (milliseconds)

PT-S ISI (milliseconds)

25 28 53
49 52 94
56 59 107
52 35 23
48 41 100
113 81 52
50 27 70
76 52 33
68 69 40
108 65 47
45 34 97
73 97 16
105 66 74
60 93 61
70 57 74
64 31 85
39 103 71
92 39 59
49 51 75
94 73 125
71 47 69
95 60 77
54 43 140
54 42 100
56 33 41

74
221
228
71
145
73
94
80
78
79
132
73
129
119
79
89
66
209
84
86
110
122
119
119
91

108
65
65
79
46
74
35
55
51
50
75
47
79
63
95
37
61
60
48
44
71
49
51
64
71

135
137
80
116
119
55
88
143
121
130
68
97
66
111
85
136
105
175
157
133
103
89
94
141
95

132
125
173
44
99
150
236
157
49
162
139
152
202
151
56
105
60
96
81
178
73
145
98
102
75

94
146
179
195
156
192
143
78
199
98
138
187
63
86
104
116
133
137
89
116
85
90
222
81
63

774

Random Phenomena

Histogram of PT-W
Gamma
25

Shape 7.002
Scale 9.094
N
100

Frequency

80
PT-W

100

120

140

Histogram of PT-S
Gamma
25

Shape 7.548
Scale 15.61
N
100

Frequency

120

160

200

240

PT-S

FIGURE 18.1: Histograms of interspike intervals data with Gamma model t for
the pyramidal tract cell of a monkey. Top panel: when awake (PT-W); Bottom Panel:
when asleep (PT-S). Note the similarities in the estimated values for the shape
parameterfor both sets of data, and the dierence between the estimates for , the
scale parameters.

Nonparametric Methods

775

that they do not. The results of this test (carried out in MINITAB) are shown
in Fig 18.2. The p-value for the A-D test in each case is higher than the typical signicance level of 0.05, leading us to conclude that it is reasonable to
consider the data as coming from gamma-distributed populations. The probability plots also show, for both cases, the entire data sets falling entirely within
the 95% condence intervals of the theoretical model line ts. The implication
therefore is that we cannot use the two sample t-test for this problem, since
the Gaussian assumption is invalid for conrmed gamma-distributed data.
An examination of the histograms in fact shows two skewed distributions that
have essentially the same shape, but the one for PT-S appears shifted to the
right. The recommendation therefore is to use the Mann-Whitney-Wilcoxon
test.

18.5.2

Mann-Whitney-Wilcoxon Test

Putting the data into two columns PT-W and PT-S in a MINITAB worksheet, and carrying out the test as illustrated earlier, yields the following
results:
Mann-Whitney Test and CI: PT-W, PT-S
N Median
PT-W 100
60.01
PT-S 100 110.56

Point estimate for ETA1-ETA2 is -48.32

95.0 Percent CI for ETA1-ETA2 is (-59.24,-39.05)
W = 6304.0
Test of ETA1 = ETA2 vs ETA1 not = ETA2 is significant at 0.0000
Since the signicance level for the test (the p-value) is zero to four decimal
places, the conclusion is that there is in fact a signicant dierence between
the activities of these neurons.
A two-sample t-test may have led us to the same conclusion, but this
would have been due more to the fortunate circumstance that the dierences
between the two activities are quite pronounced.
It is left as an exercise to the reader to provide some interpretations for
what the estimated gamma model parameters might mean in terms of the
phenomena underlying the generation of action potentials in these pyramidal
tract neurons of the monkey.

776

Random Phenomena

Probability Plot of PT-W

Gamma - 95% CI
99.9

Shape
Scale
N
AD
P-Value

Percent

95
90
80
70
60
50
40
30
20

7.002
9.094
100
0.255
>0.250

10
5
1

0.1

100
PT-W

Probability Plot of PT-S

Gamma - 95% CI
99.9

Shape
Scale
N
AD
P-Value

Percent

95
90
80
70
60
50
40
30

7.548
15.61
100
0.477
0.245

20
10
5
1

0.1

100
PT-S

FIGURE 18.2: Probability plot of interspike intervals data with postulated Gamma
model and Anderson-Darling test for the pyramidal tract cell of a monkey. Top panel:
when awake (PT-W); Bottom panel: when asleep (PT-S). The p-values for the A-D
tests indicate no evidence to reject the null hypothesis

Nonparametric Methods

18.6

777

Summary and Conclusions

We have presented in this chapter some of the important nonparametric alternatives to consider when the assumptions underlying standard (parametric)
hypothesis tests are not valid. These nonparametric techniques are applicable
generally because they impose few, if any, restrictions on the data. They are
known as distribution free methods because they do not rely on any specic
distributional characterization for the underlying population. As a result, the
primary variable of nonparametric statistical analysis is the rank sum, and for
good reason: regardless of the underlying population from which they arose,
all dataeven qualitative data (so long as it is ordinal)can be ranked, and
the appropriate sums of such ranks provide valuable information about how
the data set is distributed, without having to assume a specic functional form
for the populations distribution.
As summarized in Table 18.3, we have focussed specically on:
The sign test: for comparing a single continuous population median, ,
to a postulated value, 0 . A nonparametric alternative to the one sample
z- or t-test, it is based on the sign of the deviation of sample data from
the postulated median. The test statistic, T + the total number of plus
signsis a binomial distributed Bi(n, 0.5) random variable if the null
hypothesis is true.

The Wilcoxon signed rank test: a more powerful version of the sign test
because it takes both magnitude and sign of the deviation from the
postulated median into consideration, but it is restricted to symmetric
distributions. The test statistic, W + , is the positive rank sum; its sampling distribution is best obtained numerically.

The Mann-Whitney-Wilcoxon (MWW) test : for comparing the median

of two independent populations with identical (but potentially shifted
apart) continuous distributions; it is the nonparametric equivalent of
the two-sample t-test. The test statistic, U , is based on W1 and W2 , the
sums of the ranks in each sample.
The Kolmogorov-Smirnov (K-S) test and the Anderson-Darling (A-D) tests
were also presented briey for probability model validation, the latter being
an improved version of the former. There are other nonparametric methods,
of course, especially methods for nonparametric regression (e.g., Spearmans
rank correlation), but space limitations prevent us from discussing them all.
The interested reader is referred, for example, to Gibbons and Chakraborti

778

Random Phenomena

(2003)3 . The Kruskal-Wallis test, a nonparametric analog to the F -test, will

be mentioned briey at the appropriate place in Chapter 19.
It should be kept in mind, nally, that what makes the nonparametric
tests versatile can also be a weakness. When underlying distributional information is availableespecially when the Gaussian assumption is validit
can be shown that nonparametric tests are not nearly as powerful: in order to
draw conclusions with the same degree of condence, larger samples will be
required. Thus, whenever populations are reasonably normal, parametric tests
are preferred. Some of the end-of-chapter exercises and problems illustrate this
point.
This concludes the discussion of hypothesis testing begun in Chapter 15;
and, in conjunction with the discussion of estimation (Chapters 13, 14, and
16) and of model validation (here and in Chapter 17), our treatment of what
is generally considered as statistical inference is complete. However, there
remains one more important issue: how to collect data such that the sample
upon which statistical inference is to be based is as informative as possible.
This is the focus of Chapter 19.

REVIEW QUESTIONS
1. What is the objective of this chapter?
2. Why are the techniques discussed in this chapter known as distribution free
methods?
3. What are the two broad classes of problems for which classical hypothesis tests
of Chapter 15 are not applicable?
4. What are some of the advantages and disadvantages of non-parametric techniques?
5. What is the one-sample sign test?
6. What does it mean that a tie has occurred in a sign test, and what is done
under such circumstances?
7. What is the test statistic used for the one-sample sign test, and what is its sampling distribution?
8. What is the large sample limiting test statistic for the one-sample sign test?
3 Gibbons, J. D. and S. Chakraborti, (2003). Nonparametric Statistical Inference, 4th
Ed. CRC Press.

= 1 2 ; (H0 : = 0 )
(General)

= 1 2 ; (H0 : = 0 )
(Paired)

; (H0 : = 0 )
Median

Test
Restrictions Statistic
None
T + , Total # of
positive deviations
from 0
Wilcoxon signed
Continuous
W + , Positive
Rank (WSR) test symmetric
rank sum
distributions
Sign test or
Same as
Same as
WSR test
above
above
Mann-WhitneyIndependent
W1 , Rank sum 1
Wilcoxon
Distributions W2 , Rank sum 2
(MWW) test
U = f (W1 , W2 )
See Eq (18.13)
Test
Sign test

Summary of Selected Nonparametric Tests and their Characteristics

Population
Parameter
(Null Hypothesis, H0 )
; (H0 : = 0 )
Median

TABLE 18.3:

Two-sample t-test

Same as
above

None

Parametric
Alternative
One-sample z-test
One-sample t-test

Nonparametric Methods
779

780

Random Phenomena

9. The one-sample sign test is the nonparametric alternative to what parametric

test?
10. Apart from being distribution-free, what is another important distinction between the one-sample sign test and its parametric counterparts?
11. When is the one-sample sign test more appropriate than the parametric alternatives? When will the parametric alternatives perform better?
12. What is the one-sample Wilcoxon signed rank test, and what dierentiates it
from the one-sample sign test?
13. What is the restriction on the Wilcoxon signed rank test?
14. What is a signed rank and how is it obtained?
15. What are the various test statistics that are used for the Wilcoxon signed rank
test? In particular, what is W + , the test statistic used in MINITAB and in this
chapter?
16. For a sample of size n, what are the minimum and maximum attainable values
for the test statistic, W + ?
17. What is the large sample approximation to the sampling distribution of W + ?
Why is this approximation not recommended?
18. As a result of the symmetric distribution restriction, can the Wilcoxon signed
rank test be considered as a direct nonparametric alternative to the z- or t-test?
19. For what symmetric distributions is the Wilcoxon signed rank test particularly
useful?
20. Why are no additional considerations needed for nonparametric two-sample
paired tests?
21. What is the Mann-Whitney-Wilcoxon test?
22. What test statistics are involved in the Mann-Whitney-Wilcoxon test?
23. Why it is especially important to rely on the computer for the Mann-WhitneyWilcoxon test?
24. The Mann-Whitney-Wilcoxon test is the nonparametric alternative to which
parametric test? Under what conditions will one be better than the other?
25. What is the Kolmogorov-Smirnov test used for?
26. On what is the Kolmogorov-Smirnov test based?

Nonparametric Methods

781

27. What are the null and alternative hypotheses in the Kolmogorov-Smirnov test?
28. What is the primary distinguishing feature of the sampling distribution of the
Kolmogorov-Smirnov test statistic?
29. The Kolmogorov-Smirnov test is a non-parametric alternative to which parametric test?
30. What are two primary limitations of the Kolmogorov-Smirnov test?
31. How is the Anderson-Darling test related to the Kolmogorov-Smirnov test?
32. The Anderson-Darling test is generally considered to be a better alternative to
which tests?

EXERCISES
18.1 In the table below, X1 is a random sample of 20 observations from an exponential population with parameter = 1.44, so that the median, = 1. X2 is the
same data set plus a constant, 0.6, and random Gaussian noise with mean = 0
and standard deviation = 0.15.
X1
1.26968
0.28875
0.07812
0.45664
0.68026
2.64165
0.21319
2.11448
1.43462
2.29095

X2
1.91282
1.13591
0.72515
1.19141
1.34322
3.18219
0.88740
2.68491
2.16498
2.84725

X1
1.52232
1.45313
0.65984
1.60555
0.08525
0.03254
0.75033
1.34203
1.25397
3.16319

X2
2.17989
2.11117
1.45181
2.45986
0.43390
0.76736
1.16390
2.01198
1.80569
3.77947

(i) Consider the postulate that the median for both data sets is 0 = 1 (which, of
course, is not true). Generate a table of signs, Dmi , of deviations from this postulated median.
(ii) For each data set, determine the test statistic, T + . What percentage of the observations in each data set has plus signs? Informally, what does this indicate about
the possibility that the true median in each case is as postulated?
18.2 Refer to Exercise 18.1 and the supplied data.
(i) For each data set, X1 and X2 , carry out a formal sign test of the hypothesis
H0 : = 1 against Ha : = 1. Interpret your results.
(ii) Now use only the rst 10 observations (on the left) in each data set. Repeat (i)
above. Interpret your results. Since we know that the two data sets are dierent,
and that the median for X2 is higher by 0.6 (on average), comment on the eect of

782

Random Phenomena

sample size on this particular sign test.

18.3 Refer to Exercise 18.1 and the supplied data.
(i) It is known that the distribution of the dierence of two exponentially distributed
random variables is the symmetric Laplace distribution (See Exercise 9.3). This
suggests that the one-sample Wilcoxon signed rank test, which is not applicable to
either X1 or X2 , being samples from non-symmetric distributions, is applicable to
D = X1 X2 . Carry out a one-sample Wilcoxon signed rank test on D to test the
null hypothesis H0 : D = 0, where D is the median of D, versus the alternative
Ha : D = 0. Interpret your result in light of what we know about how the two data
sets were generated.
(ii) Carry out a Mann-Whitney-Wilcoxon (MWW) two-sample test directly on the
two samples X1 and X2 . Discuss the dierence between the results obtained here
and the ones obtained in (i).
(iii) How are the results obtained in this exercise reminiscent of the dierence between the parametric standard two-sample t-test and a paired t-test discussed in
Chapter 15?
18.4 In Exercise 17.4, the two data sets in the table below were presented as random
samples from two independent lognormal distributions; specically, XL1 L(0, 0.25)
and XL2 L(0.25, 0.25).
(i) From these postulated theoretical distribution characteristics, determine the median of each population, 01 for XL1 and 02 for XL2 . Carry out a one-sample sign
test that the median of the sample data XL1 is 01 versus the alternative that it is
not. Also carry out a one-sample sign test, this time that the median of the sample
data XL2 is also 01 versus the alternative that it is not. Is this test able to detect
that XL2 does not have the same median as XL1 ?
(ii) Carry out an appropriate log transformation of the data to obtain Y1 and Y2
respectively from XL1 and XL2 . Determine the theoretical means and variances for
the postulated populations of the log-transformed data, respectively (1 , 12 ) and
(2 , 22 ). Carry out a one-sample z-test that the mean of Y1 is the just-determined
1 value, versus the alternative that it is not. Also carry out a second one-sample
z-test, this time that the mean of Y2 is the same as the just-determined 1 value,
versus the alternative that it is not.
(iii) Comment on any dierences observed between the results of the sign test on
the raw data and the z-test on the transformed data.
XL1
0.81693
0.96201
1.03327
0.84046
1.06731
1.34118
0.77619
1.14027
1.27021
1.69466

XL2
1.61889
1.15897
1.17163
1.09065
1.27686
0.91838
1.45123
1.47800
2.16068
1.46116

18.5 Refer to Exercise 18.4. (i) Carry out a MWW test on the equality of the medi-

Nonparametric Methods

783

ans of XL1 and XL2 , versus the alternative that the medians are dierent. Interpret
your result.
(ii) Carry out a two-sample z-test concerning the equality of the means of the log
transformed variables Y1 and Y2 , against the alternative that the means are dierent. Interpret your results.
(iii) Comment on any dierences observed between these two sets of results.
18.6 The data in the table below are from two populations that may or may not be
the same.
XU
0.65
2.01
1.80
1.13
1.74
1.36
1.55
1.05
1.55
1.63

YU
1.01
1.75
1.27
2.48
2.91
2.38
2.79
1.94

(i) Carry out a two-sample t test to compare the population means. What assumptions are required for this to be a valid test? At the = 0.05 signicance level, what
does this result imply about the null hypothesis?
(ii) Carry out a MWW test to determine if the two populations have the same medians or not. At the = 0.05 signicance level, what does this result imply about
the null hypothesis?
(iii) It turns out that the data were generated from two distinct uniform distributions, where the Y distribution is slightly dierent. Which test, the parametric or
the nonparametric, is more eective in this case? Oer some reasons for the observed
performance of one test versus the other.
(iv) In light of what was specied about the two populations in (iii), and the p-values
associated with each test, comment on the use of = 0.05 as an absolute arbiter of
signicance.
18.7 In an opinion survey on a particular political referendum, fteen randomly
samples individuals were asked to give their individual opinions using the following
options:
1= Strongly agree; 2 = Agree; 3 = Indierent; 4 = Disagree; 5 = Strongly disagree.
The result is the data set, S15 .
S15 = {1, 4, 5, 3, 2, 3, 3, 2, 2, 2, 1, 3, 3, 3, 3}
Carry out a one-sample sign test to conrm or refute the allegation that the population from which the 15 respondents were sampled has a median of people that are
indierent to the referendum in question.
18.8 The following is a set of residuals from a two-parameter regression model:

784

Random Phenomena
0.97
0.68

0.12
0.04

0.27
0.33

0.43
0.04

0.17
0.34

0.10
0.85

0.58
0.12

0.96
0.14

0.04
0.49

0.54
0.09

(i) Carry out a K-S test of normality and compare it with an A-D test of normality. What are the associated p-values and what do these imply about the normality
of these residuals?
(ii) It turns out that the residuals shown above had been ltered by removing
four observations that appeared to be outliers: {1.30, 1.75, 2.10, 1.55}. Reinstate
these residuals and repeat (i). Does this change any of the conclusions about the
normality of the residuals?
18.9 The following data is from a population that may or may not be normally
distributed.
0.88
0.74

2.06
0.30

6.86
1.06

1.42
3.08

2.42
1.12

0.29
0.63

0.74
0.58

0.91
0.48

2.30
3.15

0.32
6.70

Carry out a K-S test of normality and compare the results with an A-D test
of normality. At the = 0.05 signicance level, what do each of these tests imply
about the normality of the data set? Examine the data set carefully and comment
on which of these tests is more likely to lead to the correct decision.

APPLICATION PROBLEMS
18.10 In Problem 15.59, hypothesis tests were devised to ascertain whether or not
out of three operators, A, B, and C, working in a toll manufacturing facility,
operator A was deemed to be more safety conscious. The data below, showing the
time in months between occurrences of safety violations for each operator, was to
be used to test these hypotheses.
A
B
C

1.31
1.94
0.79

0.15
3.21
1.22

3.02
2.91
0.65

3.17
1.66
3.90

4.84
1.51
0.18

0.71
0.30
0.57

0.70
0.05
7.26

1.41
1.62
0.43

2.68
6.75
0.96

0.68
1.29
3.76

Unfortunately, the random variable in question is exponentially distributed; the

sample size of 10 is considerably smaller than is required for a normal approximation
to be valid for the sampling distribution of the sample mean; and therefore no
standard hypothesis test could be used.
Convert the data to a form that will be appropriate for carrying out a one-sample
Wilcoxon signed rank test to compare operator A to operator B, and a second
one-sample Wilcoxon signed rank test to compare operator A to operator C.
Interpret your results. Can operator A truly be said to be more more safety conscious than either B or C? If yes, at what signicance level?
18.11 Refer to Problem 18.10.
(i) Carry out a two-sample t-test to compare the mean time between occurrences
of safety violations for operator A to that for operator B; carry out a second

Nonparametric Methods

785

two-sample t-test comparing the means for operator A to that for operator C.
Are these valid tests?
(ii) Carry out a MWW test to compare the safety performance of operator A to
that for operator B and a second MWW test to compare operator A to operator
C. Are these valid tests? Interpret your results and compare them to the results in
(i). What conclusions can you reach about these operators and how safety conscious
B and C are in comparison to A?
18.12 The Philadelphia Eagles, like every other team in the National Football
League (NFL) in the US, plays 8 games at home and 8 games away in each 16game season. The table below shows the total number of points scored by this NFL
team at home and away during the 2008/2009 season.
Total Points Scored, 2008/2009 Season
Home 38 15 17 27 31 48 30 44
Away 37 20 40 26 13 7 20 3
(i) Generate side-by-side box-plots of the two data sets and comment on what these
plots shows about the potential dierence between the number of points scored at
home and away.
(ii) Use the two-sample t-test to compare the mean oensive productivity at home
versus away. What assumptions are necessary for this to be a valid test? Are these
assumptions reasonable? Interpret your result.
(iii) Next, carry out a MWW test. Interpret your result.
(iv) Allowing for the fact that there are only 8 observations for each category, discuss
your personal opinion of what the data set implies about the dierence in oensive
output at home versus away, vis a
` vis the results of the formal tests.
18.13 Refer to Problem 18.12. This time, the table below shows the point
dierentialpoints scored by the Philadelphia Eagles minus points scored by the
opponentat home and away, during the 2008/2009 season. Some consider this
metric a better measure of ultimate team performance (obviously, a negative point
dierential corresponds to a loss, a positive dierential a win, and a zero, a tie).

Home
Away

Point Dierential, 2008/2009 Season

35
9 6 13 5
28 20
4 4 14 19
0 29
6

38
7

(i) Generate side-by-side box-plots of the two data sets and comment on what this
plot shows about the potential dierence between the teams performance at home
and away.
(ii) Carry out a two-sample t-test to compare the teams performance at home versus away. What assumptions are necessary for this to be a valid test? Are these
assumptions reasonable? Interpret your results.
(iii) Next, carry out a MWW test. Interpret your result.
(iv) Again, allowing for the fact that there are only 8 observations in each case,
discuss your personal opinion about the dierence in team performance at home
versus away vis a
` vis the results of the formal tests.
18.14 The table below shows the result of a market survey involving 15 women and

786

Random Phenomena

15 men who were asked to taste a name brand diet Cola drink and compare the
taste to that of a generic brand that is cheaper, but, as claimed by the manufacturer, whose taste is preferred by a majority of tasters. The options given to the
participants are as follows: 1= Strongly prefer generic cola; 2 = Prefer generic cola;
3 = Indierent; 4 = Prefer name brand cola; 5 = Strongly prefer name brand cola.
Perform appropriate tests to validate the claims that (i) Men are mostly indifferent, showing no preference one way or another (which is a positive result for
the generic cola manufacturer); and (ii) there is no dierence between Women and
Men in their preferences for the diet Cola brands. Is there evidence in the data to
support one or both or none of these claims?
Women
4
3
5
5
2
3
1
4
4
5
4
3
4
3
4

Men
1
3
3
2
4
5
1
2
4
3
3
2
1
3
3

18.15 Random samples of size 10 each are taken from large groups of trainees instructed by Method A and Method B, and each trainees score on an appropriate
achievement test is shown below.
Method A
Method B

71
72

75
77

65
84

69
78

73
69

66
70

68
77

71
73

74
65

68
75

In Example 15.6, the data sets were assumed to come from normal populations
and a two-sample t-test was conducted to test the claim that Method B is more efcient. As an alternative to that parametric test, carry out a corresponding MWW
test. Interpret your result and compare it to that in Example 15.6. Is there a dierence in these results? Comment on which test you would consider more reliable and
why.
18.16 Tanaka-Yamawaki (2003)4 , presented models of high-resolution nancial time
series which showed, among other things, that price uctuations tend to follow the
Cauchy distribution, not the Gaussian distribution as usually presumed. The following table shows a particular sequence of similar price uctuations.
4 Mieko Tanaka-Yamawaki, (2003). Two-phase oscillatory patterns in a positive feedback
agent model Physica A 324, 380387

Nonparametric Methods
0.003322
0.000856
0.010382
0.011494
0.012165

0.000637
0.002606
0.061254
0.004949
0.008889

0.003569
0.003797
0.261127
0.005694
0.023339

787
0.032565
0.001522
0.032485
0.034964
0.009220

(i) Obtain a histogram and a box plot of this data. Interpret these plots vis a
`
vis the usual normality assumption of data of this sort.
(ii) Carry out a K-S test and also an A-D test of normality on this data. What can
you conclude about the normality of this data set?
(iii) The data entry, 0.261127, although real, is what most might refer to as an
outlier which will then be removed. Remove this data point and repeat (i) and
(ii). Comment on the inuence, if any, of this point on your analysis.
18.17 The number of accidents occurring per quarter (three months) at a DuPont
company facility, over a 10-year period is shown in the table below5 , partitioned into
two periods: Period I for the rst ve-year period of the study; Period II, the second
ve-year period.

5
4
2
5
6

Period I
5 10 8
5 7
3
8 6
9
6 5 10
3 3 10

3
1
7
1
4

Period II
4 2 0
3 2 2
7 1 4
2 2 1
4 4 4

The phenomenon in question is clearly Poisson, so that it is not exactly valid

to consider the data as samples from a normal population (even though with a
sample size of 20 each, the distribution of the sample mean may be approximately
Gaussian).
Carry out a MWW test to determine whether by Period II, there has been a signicant improvement in the number of accidents occurring at this facility. Interpret
your results adequately. Strictly speaking, is this a valid application of the MWW
test? Why or why not?
18.18 A limousine company that operates in a large metropolitan city in the North
East of the United States has traditionally used in its eet only two brands of
American-made luxury cars. (To protect the manufacturers, we shall refer to these
simply as brands A and B.) For various reasons, the company wishes to consolidate
its operations and use only one brand. The decision is to be based on which brand
experienced the fewest number of breakdowns over the immediately preceding veyear period. Seven cars per brand were selected randomly from the eet, and from
their maintenance records over this period, the following information was gathered
on the number of breakdowns experienced by each car. Carry out an appropriate
analysis of the data and recommend, objectively, which car brand, A or B, the company should select.
5 Lucas

J. M., (1985). Counted Data CUSUMs, Technometrics, 27, 129144

788

Random Phenomena
Total # of Breakdowns
Brand A
Brand B
11
9
9
12
12
10
14
8
13
4
12
10
11
8

18.19 In Problem 15.52 (see also Problems 1.13 and 14.42) the data in the following table was analyzed to test a hypothesis about the mean time-to-publication for
papers sent to a particular leading chemical engineering research journal. The data
shows the time (in months) from receipt to publication of 85 papers published in
the January 2004 issue of this journal.
(i) On phenomenological grounds, and from past experience, a gamma distribution
has been postulated for the population from which the data was sampled. The population parameters are unavailable. Carry out an A-D test to validate this probability
model postulate.
(ii) In recognition of the fact that the underlying distribution is obviously skewed,
the Editor-in-Chief has modied his hypothesis about the characteristic time-topublication, and now proposes that the median time-to-publication is 8 months.
Carry out an appropriate test to assess the validity of this statement against the
alternative that the median time-to-publication is not 8 months. Interpret your results.
(iii) Repeat the test in (ii), this time against the alternative that the median timeto-publication is longer than 8 months. Reconcile this result with the one in (ii).
19.2
9.0
17.2
8.2
4.5
13.5
20.7
7.9
19.5
8.8
18.7
7.4
9.7
13.7
8.1
8.4
10.8

15.1
5.3
12.0
3.0
18.5
5.8
6.8
14.5
3.3
11.1
16.4
7.3
7.4
7.3
5.2
10.2
3.1

9.6
12.9
17.3
6.0
24.3
21.3
19.3
2.5
9.1
8.1
9.8
15.4
15.7
8.2
8.8
7.2
12.8

4.2
4.2
7.8
9.5
3.9
8.7
5.9
5.3
1.8
10.1
10.0
18.7
5.6
3.3
7.3
11.3
2.9

5.4
15.2
8.0
11.7
17.2
4.0
3.8
7.4
5.3
10.6
15.2
11.5
5.9
20.1
12.2
12.0
8.8

18.20 Many intrinsic characteristics of single crystals aect the intensities of X-rays
diracted from them in dierent directions. The statistical distribution of these intensity measurements therefore provides means by which to characterize these crystals. Unfortunately, because of their distinctive heavy tails, these distributions
are not adequately described by traditional default Gaussian distributions, which

Nonparametric Methods

789

is why Cauchy distributions have been nding increasing application in crystallographic statistics. (See for example Mitra and Sabita, (1989)6 , and (1992)7 .)
The table below is extracted from a larger sample of X-ray intensity measurements (arbitrary units) for two crystals that are very similar in structure, one labeled
A is natural, the other, B, synthetic. The synthetic crystal is being touted as a replacement for the natural one.
X-ray intensities (AU)
XRDA
XRDB
104.653
132.973
106.115
114.505
104.716
115.735
104.040
114.209
105.631
114.440
104.825
100.344
117.075
116.067
105.143
114.786
105.417
115.015
98.574
99.537
105.327
115.025
104.877
115.120
105.637
107.612
105.305
116.595
104.291
114.828
100.873
114.643
106.760
113.945
105.594
113.974
105.600
113.898
105.211
125.926
105.559
114.952
105.583
117.101
105.357
113.825
104.530
117.748
101.097
116.669
105.381
114.547
104.528
113.829
105.699
115.264
105.291
116.897
105.460
113.026
(i) Carry out and A-D normality tests on each data set and conrm that these
distributions, even though apparently symmetric, are not Gaussian. Their heavy
tails imply that the Cauchy distribution may be more appropriate.
(ii) The signature characteristic of the Cauchy distribution, that none of its moments
exists, has the serious implication for statistical inference that, unlike other nonpathological distributions for which the sampling distribution of the mean narrows
with increasing sample size, the distribution of the mean of a Cauchy sample is
6 Mitra, G.B. and D. Sabita (1989). Cauchy Distribution, Intensity Statistics and Phases
of Reections from Crystal Planes. Acta Crystallographica A 45, 314-319.
7 Mitra, G.B. and D. Sabita (1992). Cauchy distribution of X-ray intensities: a note on
hypercentric probability distributions of normalized structure factors. Indian Journal of
Physics 66 A (3), 375-378.

790

Random Phenomena

precisely the same as the original mother distribution. As a result, none of the
standard parametric tests can be used for samples from Cauchy (and Cauchy-like)
distributions.
From the sample data provided here, carry out an appropriate nonparametric
test to ascertain the validity of the proposition that the synthetic crystal B is
the same as the natural crystal A, strictly on the basis of the X-ray intensity
measurement.

Chapter 19
Design of Experiments

19.1 Introductory Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19.1.1 Experimental Studies and Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.1.2 Phases of Ecient Experimental Studies . . . . . . . . . . . . . . . . . . . . . . . . .
19.1.3 Problem Denition and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.2 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.3 Single Factor Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.3.1 One-Way Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Postulated Model and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Layout and Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . .
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fixed, Random, and Mixed Eects . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.3.2 Kruskal-Wallis Nonparametric Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.3.3 Two-Way Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Randomized Complete Block Design . . . . . . . . . . . . . . . . . . . . . .
Postulated Model and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Layout and Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . .
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.3.4 Other Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.4 Two-Factor Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Postulated Model and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Layout and Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . .
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.5 General Multi-factor Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.6 2k Factorial Experiments and Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Notation and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.6.2 Design and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.6.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sample Size Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.6.4 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.7 Screening Designs: Fractional Factorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.7.1 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.7.2 Illustrating the Mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.7.3 General characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Notation and Alias Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Design Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.7.4 Design and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Projection and Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.7.5 A Practical Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Design and Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Analysis Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

792
793
794
795
795
797
797
797
798
799
803
803
805
805
805
806
806
807
810
811
811
812
812
812
814
814
815
815
816
817
818
821
821
822
822
823
823
824
825
825
825
827
827
827
828

791

792

Random Phenomena

Analysis Part II: Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19.8 Screening Designs: Plackett-Burman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.8.1 Primary Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.8.2 Design and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.9 Response Surface Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.9.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.9.2 Response Surface Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.9.3 Design and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.10Introduction to Optimal Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.10.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.10.2 Alphabetic Optimal Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.11Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

831
832
833
833
834
834
835
836
837
837
837
839
840
842
849

We may have three main objects in the study of truth:

rst, to nd it when we are seeking it;
second, to demonstrate it after we have found it;
third, to distinguish it from error by examining it.
Blaise Pascal (16231662)

Every experimental investigation presupposes that the sought-after information is contained in the acquired data sets. Objectively, however, such presupposition is often unjustiable. Without giving purposeful, deliberate and careful consideration to data collection, the information content of the acquired
data set cannot be guaranteed. Because experimental data will always be nite samples drawn from the much larger population of interest, conclusions
drawn about the population will be valid only if the sample is representative.
And the sample will be representative only if it encapsulates relevant population information appropriately. These issues have serious consequences. If the
sought-after information is not contained in the data, no analysis technique
no matter how sophisticated will be able to extract it. Therefore, how
data sets are obtained will aect not only the information they contain, but
also our ability to extract and use this information.
This chapter focusses on how experiments can be designed to ensure that
acquired data sets are as informative as possible, and what analysis procedures
are jointly calibrated with the experimental designs to facilitate the extraction
of such information. In recognition of the fact that several excellent booklength treatises exist on the subject matter of design of experiments, this
chapter is designed to be only an introduction to some of the most commonly
applied techniques, emphasizing principles and, where possible, illustrating
practice with relevant examples. Because of the role now played by computer
software, our discussion deemphasizes the old-fashioned mechanics of manual
computations. This allows us to focus on the essentials of experimental designs
and on interpreting the results produced by computer programs.

Design of Experiments

19.1
19.1.1

793

Introductory Concepts
Experimental Studies and Design

In this chapter, the term experimental studies is used in a restricted

sense to describe those investigations involving the deliberate manipulation
of independent variables, x1 , x2 , . . . , xk , and observing, by measurement, the
response of the dependent variables, Y1 , Y2 , . . . , Yn . This is in direct contrast
to observational studies where the experimenter is a passive observer not
directly involved in selecting the values of the independent variables which
may (or may not) have been responsible for the observed responses. For example, the earliest studies that led to the hypothesis that smoking may cause
lung cancer were based on the observation of higher incidences of lung cancer
in smokers compared to non-smokers; the hypothesis was not (and, ethically,
could not have been) based on experimental studies where, over a pre-specied
period of time, a select group of participants were assigned cigarettes to smoke
and the rest were not to smoke. The data were simply observed, not controlled. On the other hand, in modern clinical studies, the number of groups
involved in the study, the assignment of treatment protocols to each group,
the collection of data, and every other aspect of the study, are rmly under
the control of the experimenters. Observational studies and how to analyze
their results are important; but these will not be discussed in this chapter.
To reiterate, the key distinguishing characteristic of the experimental studies of concern in this chapter is that the experimenter is assumed to have
control over setting the values of the x variables at which the experiments are
performed. The nature of these studies therefore involves
1. Setting the values of x where data on Y is to be acquired, and then
2. Transforming the acquired data (via appropriate analysis and interpretations of the results) to understanding as captured via a mathematical
representation of the eect of the x variables on the Y variables.
3. Such understanding and mathematical representations are then typically
used for various applications, ranging from general analysis and prediction to engineered system design, improved system operation, control,
and optimization.
The two basic tasks involved in these experimental studies are, therefore:
1. Systematic data acquisition (to ensure informative data sets); and
2. Data transformation to understanding via ecient information extraction.
But carrying out these tasks eectively is complicated by random variation,

794

Random Phenomena

because, associated with every observation, yi , is an unavoidable and unpredictable uctuation, i , masking the true value ; i.e.,
yi = + i

(19.1)

As far as the rst task is concerned, the key consequence of Eq (19.1) is that
for acquired data to be informative, one must nd a way to maximize the
eect of independent variables on y and minimize the inuence of the random
component. With the second task, the repercussion of Eq (19.1) is that ecient
data analysis requires using appropriate concepts of probability and statistics
presented in earlier chapters. Most revealing, however, is how much Task 1
inuences the success of experimental studies: information not contained in
the data cannot be extracted even by the most sophisticated analysis.
Statistical Design of Experiments enables ecient conduct of experimental studies by providing a formal strategy of experimentation and the
corresponding analysis technique, jointly calibrated for optimal extraction
of information in the face of unavoidable random variability. This is especially important when resources, both nancial and otherwise, are limited.
The techniques to be discussed in this chapter allow the acquisition of richly
informative data with the fewest possible experimental runs. Depending on
the goals of the investigation whether it is screening for which independent
variables matter the most; developing a quantitative model; nding optimum
operating conditions; conrming model predictions, etc there are appropriate experimental designs specically calibrated to the task at hand.

19.1.2

Phases of Ecient Experimental Studies

Experimental studies that deliver the most benet for the expended eort
are carried out in distinct and identiable phases:

1. Planning: where the scope, and goals of the experimental study are
clearly dened;
2. Design: where the strategy of experimentation that best ts the goals is
determined and the explicit procedure for data gathering is specied;
3. Implementation: which involves mostly disciplined execution of the design strategy and the careful collection of the data;
4. Analysis: where the appropriate statistics are computed and the results
are interpreted.

This chapter is concerned primarily with design and analysis, with the

Design of Experiments

795

assumption that the reader/practitioner has been diligent in planning and in

implementation.
The explicit layout of specic designs, including the generation of complex
designs and the specication of such characteristics as alias structures (see
later) is now routinely carried out by computer software packages. This is in
addition to the computers traditional role in relieving the practitioner of the
burden of tedious computations involved in data analysis, and in carrying out
diagnostic tests. However, while the computer will assist with the design of the
experiments and with the analysis of the acquired data, (it might even record
the data directly, in some instances), it will not absolve the practitioner of the
responsibility to think systematically, and evaluate the result judiciously.

19.1.3

Problem Denition and Terminology

The general problem at hand involves testing for the existence of real
eects of k independent variables, x1 , x2 , . . . , xk , on possibly several dependent
variables. Here are some examples: investigations into the eect of pH (x1 ), salt
concentration (x2 ), salt type (x3 ), and temperature (x4 ), on Y , the solubility
of a particular globular protein; or whether or not dierent kinds of fertilizers
(x1 : Type I or II), farm location (x2 : A, B or C), grades of seed (x3 : Premium,
P ; Regular, R; or Hybrid, H), show any signicant eects on Y , the yield (in
bushels per acre) of soybeans.
The following terminology is standard.
1. Response, Y : is the dependent variable; the objective of the experiment is
to measure this response and determine what contributes to its observed
value;
2. Factors: the independent variables whose eects on the response are to
be determined;
3. Level : the (possibly qualitative) value of the factor employed in the
experimental determination of a response; e.g. fertilizer type above has
two levels: I and II; farm location has 3 levels, A, B or C; seed grade has
3 levels, P, R, or H.
4. Treatment : the various factor-level combinations employed in the experiment, e.g. in the example given above, Fertilizer I, Farm Location A,
Seed Grade P constitutes one treatment. This example has 233 = 18
total possible treatments. For a single factor experiment, the levels of
this factor are the same as the treatments, by default.

796

19.2

Random Phenomena

Analysis of Variance

Let 1 , 2 , . . . , k , represent the individual eects of the k factors on the

response variable, Y . Then at one level, the objective of most experimental
studies, from the simplest to the most complicated, can be summed up as
attempts to test the hypothesis that 1 = 2 = . . . = k = 0 simultaneously;
i.e., we presume that these k factors are all equal in having no eects whatsoever on the response Y , ascribing non-zero eects only if there is sucient
evidence to disprove the null hypothesis.
When k > 2, this problem is an extension of the testing of equality of two
means discussed in Chapter 15. Because of the application of the two-sample
t-test to those problems discussed in Chapter 15, one might be tempted to
consider multiple pair-wise t-tests as a way to handle this current problem of
comparing k means simultaneously. However, such multiple pair-wise comparisons are subject to too high an -risk. This is because even when sampling
from the same distribution, it can be shown that with multiple pairs, there
is a high probability of obtaining by pure chance alone dierences that
are too large.
One eective method for testing hypotheses about the equality of k population means simultaneously is ANOVA, the analysis of variance technique that
made a cameo appearance in a restricted sense in Chapter 16 while we were
investigating the signicance of the regression model. As we saw briey during
that discussion, ANOVA is predicated on the orthogonal decomposition of the
total variation observed in Y into various constituent components. In the more
general treatment, the constituent components are dictated by the problem
at hand, which is why with regression, the constituent parts were SSR , due to
regression, and SSE , the left-over residual error sum of squares, as dictated
by the regression problem. The contributions of the various components to
the total variation in Y are then tested for signicance. Any indication of
signicance will then suggest non-equality of means.
The two central assumptions in ANOVA are as follows:

1. Normality and Equal Variance:

For each treatment, each response, Y , is a random variable that has a
normal distribution with the same (but unknown) variance 2 . Consequently, factors aect only the mean of Y , not its variance.
2. Independence:
The observations of Y are independent for each treatment.

The principles behind ANOVA lie at the heart of all the computations

Design of Experiments

797

involved in experimental designs, with the nature of the problem at hand

naturally determining the design and how the data is analyzed.
The rest of the chapter is now devoted to discussing various experiments
and appropriate designs, beginning with the simplest involving a single
factor and building up to more complex experiments involving multiple
factors and complex objectives.

19.3

Single Factor Experiments

19.3.1

One-Way Classication

The simplest possible experiment is one involving a single factor and k

levels, giving rise to a total of k treatments. In this case, the application of
treatment j gives rise to a response Y with mean j and variance 2 , for all
j = 1, 2, 3, . . . , k.
The data collection is k sets of random samples of size n1 , n2 , . . . , nj , . . . , nk
drawn respectively from populations 1, 2, . . . , j, . . . , k. The primary question
of interest is as follows:
Do all treatments have the same eect on the response? i.e., is
1 = 2 = . . . = j = . . . = k ?
As a concrete example, consider standard size Raisin Bran boxes lled on
5 dierent processing machines, where we are interested in the total number of
raisins in a box. More specically, we are interested in answering the question:
Is there any systematic dierence in the number of raisins dispensed per box
by the machines? Does the single factor, Machine (of which there are ve),
have an eect on the response, the number of raisins per box?
Let us now suppose that to answer this question, we choose 6 sample boxes
from each machine and count number of raisins in each box. The following is
a break down of the characteristics of this problem:
Response: Number of raisins per box;
Factor: 1 (Processing Machine);
Levels: 5 (5 dierent processing machines);
Treatments: 5;
Result: yij , the number of raisins found in box i lled on machine j; this
is a realization of the random variable Yij ; i = 1, 2, 3, . . . 6; j = 1, 2, . . . 5;
Assumption: Yij N (j , 2 ).

798

Random Phenomena

Postulated Model and Hypotheses

For problems of this type, the postulated model is:
Yij = j + ij ; i = 1, 2, . . . , nj ; j = 1, 2, . . . , k

(19.2)

along with the distributional assumption, Yij N (j , 2 ); j is the mean

associated with the j th treatment and ij is the random error. In words, this
states that but for the random component, all observations from the j th treatment are characterized by a constant mean, j . The associated hypotheses are:
H0 :
Ha :

1 = 2 = . . . = k
= m for some and m.

(19.3)

so that all treatments are hypothesized to be identical, unless there is evidence

to the contrary.
If we represent as , the grand mean of the complete data set, and express
the j th treatment mean as:
j = + j ; j = 1, 2, . . . , k

(19.4)

then j represents the j th treatment eect, that quantity that distinguishes

the j th treatment mean from the grand mean. Observe from Eq (19.4) that
by the denition of the grand mean,
k

j = 0

(19.5)

j=1

(See Exercise 19.2) Furthermore, observe that if the j th treatment has no

eect on Y , j = 0. As a result, the expression in Eq (19.2) may be rewritten
as:
(19.6)
Yij = + j + ij ; i = 1, 2, . . . , nj ; j = 1, 2, . . . , k
and the hypotheses in Eq (19.7) as:
H0 :

1 = 2 = . . . = k = 0

Ha :

= 0 for at least one .

(19.7)

Data Layout and Experimental Design

The layout of the data table from such an experiment is shown in Table
19.1. The appropriate experimental design is the completely randomized
design where each observation, yij , is determined in random order. This is
the central feature of this design. Conceptually, it ensures that each yij is
truly independent of the others, as required by the independence assumption;
practically, it has the net impact that any extraneous eects are broken up

Design of Experiments

799

TABLE 19.1:

Data table for typical single-factor

experiment
Treatment
(Factor level)

y11
y21
..
.

y12
y22
..
.

y13
y23
..
.

..
.

y1j
y2j
..
.

..
.

y1k
y2k
..
.

yn1 1
T1
y.1

yn2 2
T2
y.2

yn3 3
T3
y.3

Total
Means

ynj j
Tj
y.j

ynk k
Tk
y.k

and distributed evenly among the observations, and not allowed to propagate
systematically.
Each treatment is repeated nj times, with the repeated observations known
as replicates. The design is said to be balanced if n1 = n2 = = nj = =
nk ; otherwise, it is unbalanced. It can be shown that for a xed total number
of experimental observations, N = n1 + n2 + + nj + + nk , the power of
the hypothesis test is maximized for the balanced design.
Analysis
We begin with the following denitions of some averages: the treatment
average,
nj
1

Y.j =
Yij
(19.8)
nj i=1
and the grand average,
k nj
k

1
Y.. =
Yij ; N =
nj
N j=1 i=1
j=1

(19.9)

Analyzing the data set in Table 19.1 is predicated on the following data decomposition
Yij = Y.. + (Y.j Y.. ) + (Yij Y.j )
(19.10)
The rst term, Y.. , is the grand average; the next term, (Y.j Y.. ), is the
deviation from the grand average due to any treatment eect; and the nal
term is the purely random, within-treatment deviation. (Compare with Eq
(19.6.) This expression in Eq (19.10) is easily rearranged to yield:
(Yij Y.. ) = (Y.j Y.. ) + (Yij Y.j )
EY = ET + EE

(19.11)

its error decomposition version, with the following implications: EY , the

vector of deviations of observation Yij from the grand average, consists of two

800

Random Phenomena

EY
EE
ET

FIGURE 19.1: Graphic illustration of the orthogonal vector decomposition of Eq

(19.11)

components: ET , the component due to any treatment eect, and EE , the

pure random error component. And now, it turns out that upon taking sumsof-squares in Eq (19.11) the result is the following sums of squares identity
(see Exercise 19.3):
nj
k

(Yij Y.. )2

j=1 i=1

(Y.j Y.. )2 +

j=1

SSY

nj
k

(Yij Y.j )2

j=1 i=1

SST + SSE

(19.12)

which, in terms of vector norms and the denition of error vectors in Eq

(19.11), is:
||EY ||2 = ||ET ||2 + ||EE ||2
(19.13)
We now observe that the simultaneous validity of Eqs (19.11) and (19.13)
implies that the vectors ET and EE constitute an orthogonal decomposition
of the vector EY as illustrated in Fig 19.1, a multidimensional version of
Pythagoras theorem. This, of course, is the ANOVA identity that we encountered briey in Chapter 16.
The primary implication of this decomposition for analyzing the singlefactor experiment is as follows: whether or not H0 is true, it can be shown
that:
E(SSE ) =
E(SST ) =

(N k) 2
(k 1) 2 +

(19.14)
k

nj j

(19.15)

j=1

with the following very important consequence: If H0 is true, then from these
two equations, the following mean error sums of squares provide two independent estimates of 2
M SE

M ST

SSE
(N k)
SST
(k 1)

(19.16)
(19.17)

Design of Experiments

TABLE 19.2:

801

One-Way Classication ANOVA Table

Degrees of Sum of Mean
Freedom Squares Square
F
k1
SST
M ST M ST /M SE

Source of
Variation
Between
Treatments
Within
Treatments (Error)
Total

(N k)

SSE

(N 1)

SSY

M SE

And now, as a consequence of the normality assumption, the test statistic

F =

M ST
M SE

(19.18)

possesses an F (1 , 2 ) distribution, with 1 = k 1, and 2 = N k, if H0

is true. If H0 is not true, as a result of Eq (19.15), the numerator of the
k
F -statistic will be inated by j=1 nj j /(k 1).
The analysis of single-factor experiments when carried out using the completely randomized design, is therefore presented in the form of the table shown
in Table 19.2, known as a one-way classication ANOVA table. The implied
computations are now routinely carried out, and the associated ANOVA tables
generated, by computer programs, as illustrated by the next example.
Example 19.1: PERFORMANCE OF RAISIN DISPENSING
MACHINES
Consider the problem of standard size Raisin Bran boxes lled on
5 dierent processing machines, introduced earlier in this chapter. To
answer the question posed: Is there any systematic dierence in the
number of raisins dispensed per box by the machines? 6 sample boxes
were selected at random from each machine, and the number of raisins
per box recorded in the table below. Use this data set to answer the
question.

Machine 1

Machine 2

Machine 3

Machine 4

Machine 5

27
21
24
15
33
23

17
12
14
7
14
16

13
7
11
9
12
18

7
4
7
7
12
18

15
19
19
24
10
20

Solution:
To use MINITAB, we begin by entering the provided data into a
MINITAB worksheet, RAISINS.MTW; the sequence for carrying out the
required analysis is Stat > ANOVA > One-Way (Unstacked). The reason for selecting the Unstacked option is that as presented in this

802

Random Phenomena
Boxplot of Machine 1, Machine 2, Machine 3, Machine 4, Machine 5
35
30

Data

25
20
15
10
5
0
Machine 1

Machine 2

Machine 3

Machine 4

Machine 5

FIGURE 19.2: Boxplot of raisins data showing what the ANOVA analysis has conrmed
that there is a signicant dierence in how the machines dispense raisins.
table, the data for each machine is in a dierent column. This is in contrast to stacking the data for all machines in a single column to which
is attached, in another column, the numbers 1, 2, etc, as identiers for
the machine associated with the indicated data.
MINITAB provides several graphical options including box plots for
the data, and probability plots for assessing the normality assumption
for the residuals. The MINITAB results are summarized below, beginning with the ANOVA table.
Results for: RAISINS.MTW
One-way ANOVA: Machine 1, Machine 2, Machine 3, Machine
4, Machine 5
Source DF
SS
MS
F
P
Factor 4
803.0 200.7 9.01 0.000
Error
25 557.2
22.3
Total
29 1360.2
S = 4.721 R-Sq = 59.04% R-Sq(adj) = 52.48%
The specic value of the F -statistic, 9.01, indicates that the larger of the
two independent estimates of the variance, M ST , the treatment mean
squares, is nine times as large as the pure error estimate. It is therefore
not surprising that the associated p-value is 0 to three decimal places.
Therefore we must reject the null hypothesis (at the signicance level
of = 0.05) and conclude that there is a systematic dierence in how
each machine dispenses raisins. A boxplot of the data is shown in Fig
19.2 for each machine where, visually, we see that Machines 1 and 5
appear to dispense more raisins than the others.

Design of Experiments

803

Let us now recall the postulated model for the single-factor experiment in Eq (19.2); treated like a regression model in which 5 parameters
j ; i = 1, 2, . . . , 5, are estimated from the supplied data (representing
the mean number of raisins dispensed by Machine j), MINITAB provides values for the estimated pure error standard deviation, S = 4.72,
2
as well as R2 and Radj
values. These have precisely the same meaning as in the regular regression problems discussed in Chapter 16. In
addition, the validity of the normality assumptions can be assessed by
examining the residuals ij . The normal probability plots for the estimated residuals are shown in Fig 19.3. The top plot is the typical plot
obtained directly from the ANOVA dialog in MINITAB; it shows only
the residuals and the best normal distribution t. A visual assessment
indicates that the normality assumptions seems to be valid. However,
for a more rigorous assessment, MINITAB also provides the option of
saving the residuals for further analysis. If this is done, and the rigorous probability model goodness-of-t test is carried out in conjunction
with the probability plot, the result is shown in the bottom panel. The
p-value associated with the A-D test is quite high (0.81) leading us to
conclude that the normality assumption indeed appears to be adequate.

Fixed, Random, and Mixed Eects

When the k factor levels (treatments) to be evaluated constitute the complete set of interest, so that the results and conclusions will not be extended
to other factor levels that have not been considered explicitly, the problem
is considered a xed eect ANOVA problem. The just concluded raisins
example posed a xed eect problem because the 5 machines involved in the
experimental study constitute the complete set in the production facility.
On the contrary, when the k factor levels (treatments) constitute a random
sample from a larger group of factor level and treatments, and the objective
is to extend the results and conclusions to other factor levels that have not
been considered explicitly, this is a random eect ANOVA problem. If, for
example, the 5 machines involved in the raisin example were selected for the
experimental study from a complete set of, say, 15 machines in the production
facility, that would have turned the problem into a random eect ANOVA
problem.
With problems involving multiple factors, it is possible for some factors
to be xed while some are random; under these circumstances, the result is
a mixed-eect ANOVA problem. A discussion of random and mixed-eect
ANOVA lies outside the intended scope of this introductory chapter.
Summary Characteristics
The following are the summary characteristics of the simple, one-way classication single factor experiment:

804

Random Phenomena

Normal Probability Plot

(responses are Machine 1, Machine 2, Machine 3, Machine 4, Machine 5)
99

95
90

Percent

80
70
60
50
40
30
20
10
5

-10

-5

0
Residual

Probability Plot of Residuals

Normal - 95% CI
99

Mean
StDev
N
AD
P-Value

95
90

-1.18424E-16
4.383
30
0.223
0.810

Percent

80
70
60
50
40
30
20
10
5

-15

-10

-5

0
Residuals

FIGURE 19.3: Normal probability plots of the residuals from the one-way classication
ANOVA model in Example 19.1. Top panel: Plot obtained directly from the ANOVA
analysis which does not provide any test statistic or signicance level; Bottom panel:
Subsequent goodness-of-t test carried out on saved residuals; note the high p-value
associated with the A-D test.

Design of Experiments

805

1. Experiment : Single Factor (k levels); xed or random eect

2. Design: Completely randomized; balanced (whenever possible)
3. Analysis: One-way ANOVA

19.3.2

Kruskal-Wallis Nonparametric Test

In the same spirit as the nonparametric tests discussed in Chapter 18,

the Kruskall-Wallis (KW) test is a nonparametric alternative to a one-way
ANOVA. Used to test the equality of the medians of two or more populations,
it is a direct generalization of the strategy underlying the Mann-WhitneyWilcoxon test, based on the rank of the data values rather than the actual
values themselves. As with the other nonparametric tests, the KW test does
not make any distributional assumptions (Gaussian or otherwise) about the
populations. The only assumption is that the samples are independent random
samples from continuous distributions with the same shape.
Thus, when the normality assumption for the one-way ANOVA is invalid,
the KW test should be used. Computer programs such as MINITAB provide this option. With MINITAB, the sequence Stat > Nonparametrics >
Kruskall-Wallis > opens a dialog box similar to the one used for the standard one-way ANOVA.

19.3.3

Two-Way Classication

By way of motivation, consider the problem of studying the eect of tire

brand on amount of wear experienced by the tire. Specically, we are interested
in answering the question: Do tire Brands A, B, and C wear dierently? An
experimental investigation of this question might involve outtting a car with
dierent brands and measuring wear after driving for a pre-specied number
of miles. But there are some potential problems: (i) The Driver Eect: will
the obviously dierent driving habits of drivers aect the observed wear? This
problem may be avoided by using a single driver and then assuming that the
drivers habit will not change over time. (ii) The Wheel Eect: the four
wheels of a car may not be identical in how they wear down tires.
The challenge is to study brand eect while avoiding contamination
with the extraneous wheel eect. The solution oered by classical design of
experiments is to use wheels as a blocking variable to obtain the randomized
complete block design.
The Randomized Complete Block Design
The randomized complete block design, the simplest extension of the completely randomized design, is characterized as follows: it consists of one primary factor with k levels (i.e., k treatments) in addition to one blocking vari-

806

Random Phenomena

TABLE 19.3:

Data table for typical

two-way classication, experiment
Factor 1
2
3
j
Blocks
B1
y11 y12 y13 y1j
B2
y21 y22 y23 y2j
..
..
..
..
..
..
.
.
.
.
.
.
Br
Total
Means

yr1
T1
y.1

yr2
T2
y.2

yr3
T3
y.3

yrj
Tj
y.j

single-factor,

Means

..
.

y1k
y2k
..
.

y1.
y2.
..
.

yrk
Tk
y.k

yr.

y..

able variable with r levels, e.g. wheels 1, 2, 3, 4 in the motivating illustration.

The treatments are allocated completely randomly within each block.
In addition to the usual assumptions underlying the completely randomized design, we now also assume that the variability due to the blocking variable is xed within each block. (For example, what Wheel 1 does to Brand
A it does to all other brands.) But this blocking eect, if it exists, may vary
from block to block (e.g. the eect of Wheel 1 may be dierent from that of
Wheels 2 or 3 or 4).
Postulated Model and Hypotheses
For this problem, the postulated model is:
Yij = + j + i + ij ; i = 1, 2, . . . , r; j = 1, 2, . . . , k

(19.19)

along with the usual distributional assumption, Yij N (j , 2 ); j , the mean

associated with the j th treatment, is the j th treatment eect; i is the ith block
eect and ij is the random error. By denition,
k

j=1

j = 0;

i = 0

(19.20)

i=1

Because we are primarily concerned with identifying the presence of a treatment eect, the associated hypotheses are as before in Eq (19.7):
H0 :
Ha :

1 = 2 = . . . = k = 0
= 0 for at least one .

Data Layout and Experimental Design

The layout of the data table from such an experiment, known as a twoway classication, is shown in Table 19.3. The experimental design is the

Design of Experiments
EY

807

EE
EB

FIGURE 19.4: Graphic illustration of the orthogonal error decomposition of Eq (19.21)

with the additional block component, EB .

randomized complete block design, two-way crossed, because each factor is

combined (crossed) with each block to yield a total of r k experiments; each
experimental treatment combination that produces the observation, yij , is run
in random order.
Analysis
The analysis technique for this variation on the basic single-factor experiment is similar to the ANOVA technique discussed earlier. This time, however,
the orthogonal decomposition has additional terms. It can be shown that in
this case, the sum of squares decomposition is
||EY ||2
SSY

=
=

||ET ||2 + ||EB ||2 + ||EE ||2

SST + SSB + SSE

(19.21)

where, the vector EY of the data deviation from the grand average, is decomposed orthogonally into three components: the vectors ET , EB and EE , the
components due, respectively, to the treatment eect, the block eect, and
random error, as illustrated in Fig 19.4. The sum of squares, SSY , the total variability in the data, is thus decomposed into SST , the component due
to treatment-to-treatment variability; SSB the component due to block-toblock variability; and SSE , the component due to pure error. This is a direct
extension of the ANOVA identity shown earlier.
It is important to note that with this strategy, the SSB component has
been separated out from the desired SST component. And now, whether or
not H0 is true, it can be shown that:
E(SSE )

= (k 1)(r 1) 2

E(SST )

= (k 1) 2 +

E(SSB )

= (r 1) 2 +

k

j=1
r

(19.22)

(19.23)

(19.24)

i=1

so that if H0 is true, then from these equations, the following mean error sums

808

Random Phenomena

TABLE 19.4:
Source of
Variation
Between
Treatments
Blocks
Error
Total

Two-Way Classication ANOVA Table

Degrees of Sum of Mean
Freedom Squares Square
F
k1
SST
M ST
M ST /M SE
r1
(k 1)
(r 1)
(rk 1)

SSB
SSE

M SB
M SE

M SB /M SE

SSY

of squares provide three independent estimates of 2

M SE

M ST

M SB

SSE
(k 1)(r 1)
SST
(k 1)
SSB
(r 1)

(19.25)
(19.26)
(19.27)

Signicance is determined using the now-familiar test statistic

F =

M ST
M SE

(19.28)

which possesses an F (1 , 2 ) distribution, this time with 1 = k 1, and

2 = (k 1)(r 1), if H0 is true. If H0 is not true, from Eq (19.23), the nuk
merator of the F -statistic will be inated by j=1 j /(k 1), but by this term
alone. Without separating out the block eect, SST would have been inated
in addition by SSB , in which case, even when H0 is true, the inationary inuence of SSB on SST could give the impression that there was a signicant
treatment eect; i.e., the treatment eect would have been contaminated by
the block eect.
The result of the two-way classication, single-factor, randomized block
design analysis is presented in the ANOVA table shown in Table 19.4, known as
a two-way classication. Once again, this ANOVA table and the computations
it entails are routinely generated by computer programs.
Example 19.2: TIRE WEAR FOR DIFFERENT TIRE
BRANDS
In an experimental study to determine if dierent brands (A, B and
C) of steel-belted radial tires wear dierently, a randomized complete
block design is used with car wheel as a blocking variable. The problem
characteristics are as follows: (i) The response is the amount of wear on
the tire (in coded units) after a standardized lab test that amounts to
an eective 50,000 road miles; (ii) the factor is Tire Brand, (iii) the
number of levels is 3 (Brands A, B and C); (iv) the blocking variable

Design of Experiments

809

is Car Wheel and (v) the number of blocks (or levels) is 4, specically Wheel 1: Front Left; Wheel 2: Front Right; Wheel 3: Rear Left;
and Wheel 4: Rear Right. The data layout is shown in the table below.
Determine whether or not the dierent tire brands wear dierently.
Factor
Blocks
Wheel 1
Wheel 2
Wheel 3
Wheel 4
Means

Means

47
45
42
48
45.5

46
43
37
50
44.00

52
51
49
55
51.75

48.33
46.33
42.67
51.00
47.08

Solution:
To use MINITAB, the provided data must be entered into a MINITAB
worksheet in staked format, as shown in the table below:

Tire Wear

Wheel Number

Tire Brand

47
46
52
45
43
51
42
37
49
48
50
55

1
1
1
2
2
2
3
3
3
4
4
4

A
B
C
A
B
C
A
B
C
A
B
C

The sequence for carrying out the required analysis is Stat > ANOVA
> Two-Way, opening a self-explanatory dialog box. The MINITAB results are summarized below.
Two-way ANOVA:
Brand
Source
DF
Wheel Number 3
Tire Brand
2
Error
6
Total
11

Tire Wear versus Wheel Number, Tire

SS
110.917
135.167
18.833
264.917

MS
36.9722
67.5833
3.1389

F
11.78
21.53

P
0.006
0.002

S = 1.772 R-Sq = 92.89% R-Sq(adj) = 86.97%

The indicated p-values for both wheel number and tire brand indicate
that both eects are signicant (at the = 0.05 level), but the most
important fact here is that the wheel eect, has been separated out and
distinguished from the primary eect of interest. Therefore we reject the
null hypothesis and conclude that there is indeed a systematic dierence

810

Random Phenomena
Normal Probability Plot
(response is Tire Wear)
99

95
90

Percent

80
70
60
50
40
30
20
10
5

-3

-2

-1

0
Residual

FIGURE 19.5: Normal probability plots of the residuals from the two-way classication
ANOVA model for investigating tire wear, obtained directly from the ANOVA analysis.
in how each tire wears; and we have been able to identify this eect
without contamination from the wheel number eect, which is itself
also signicant.
Again, viewed as a regression model, seven parameters, the 3 treatment eects, j ; j = 1, 2, 3, and the 4 block eects, i , i = 1, 2, 3, 4,
can be estimated from the two-way classication model in Eq (19.19).
MINITAB provides values for the estimated pure error standard deviation, S = 1.772; it also provides values for R2 = 92.89% and
2
= 86.97%. These latter values show that a signicant amount
Radj
of the variation in the data has been explained by the two-way classi2
value arises from estimating 7
cation model. The reduction in the Radj
parameters from a total of 12 data points, with not too many degreesof-freedom left. Still, these values are quite decent.
The normal probability plots for the estimated residuals, obtained
directly from the ANOVA dialog in MINITAB, are shown in Fig 19.5.
A visual assessment indicates that the normality assumptions appears
valid. It is left as an exercise to the reader to carry out the more rigorous
assessment by saving the residuals and then carrying out the rigorous
probability model goodness-of-t separately (Exercise 19.7).

It is important to note that in the last example, both brand eect and
wheel eect were found to be signicant. Using wheels as a blocking
variable allowed us to separate out the signicant wheel eect from the real
object of the investigation; without blocking, the wheel eect would have
been compounded with the brand eect. This could have serious repercussions
particularly when the primary factor has no eect and the blocking factor has
a signicant eect.

Design of Experiments

19.3.4

811

Other Extensions

In the two-way classication case above, the key issue is that in addition to the single factor of interest, there was another variable a so-called
nuisance variable that could potentially contaminate our analysis. In general, with single factor experiments, there is the possibility of more nuisance
variables, but the approach remains the same: block on nuisance variables.
When there is only one nuisance variable, the appropriate design, as we have
seen, is the randomized complete block design, with analysis provided by the
two-way classication ANOVA (one primary factor; one nuisance variable).
With two blocking variables, the appropriate design is known as the Latin
Square design, leading to a three-way classication ANOVA (one primary
factor; two nuisance variables). With three blocking variables, we use the
Graeco-Latin Square design, and a four-way classication ANOVA (one primary factor; three nuisance variables) etc. A discussion of these and other
such designs lie outside the intended scope of this introductory chapter. (See
Box, Hunter and Hunter, 20051)

19.4

Two-Factor Experiments

With two-factor experiments, there are two legitimate factors of interest

(not one factor in conjunction with a nuisance variable), a levels of factor
A, and b levels of factor B. Both factors are varied together to produce a b
total treatments. For example, consider an experiment in which the eect
of temperature and catalyst loading on conversion in a batch reactor, is
to be investigated at 150C, 200 C and 250 C along with 20% and 35%
catalyst loading. This is a two-factor experiment, with the conversion as
the response, and three levels of one factor, temperature, and two levels of the
other factor, catalyst loading, for a total of 6 treatments.
The objective in this case is to determine the eects on the response variable of each individual factor, and of the possible interactions between the
two factors, when the eect of factor A changes at dierent levels of factor B.
It can be shown that ascertaining the interaction eects requires replication
of each treatment. The data layout is therefore conceptually similar to that
for randomized block case, except in two major ways
1. Now the eect of the second variable is important; and
2. Replication is mandatory.
1 Box, G.E.P, J.S. Hunter, and W.G. Hunter, (2005) Statistics for Experimenters: Design
Innovation and Discovery, 2nd Ed., Wiley Interscience, N.J.

812

Random Phenomena

Postulated Model and Hypotheses

The postulated model for the two-factor case is:
Yij = + i + j + ij + ijk ; i = 1, 2, . . . , a; j = 1, 2, . . . , b; k = 1, 2, . . . , r
(19.29)
Here, i is the main eect of the ith level of Factor A; j is the main eect of
the j th level of factor B; ij is the eect of the interaction between ith level
of factor A and the j th level of factor B; and ijk is the random error. Again,
by denition,
a

i = 0;

i=1

b

j=1

i = 0;

a
b

ij = 0

(19.30)

i=1 j=1

As usual, the distributional assumption is Yij N (j , 2 ). Because we are

concerned with identifying all the main eects and interactions, in this case,
the null hypotheses are:
H0 :

1 = 2 = . . . = a = 0

H0
H0

1 = 2 = . . . = b = 0
ij = 0; i, j

:
:

(19.31)

and the alternatives are:

Ha :
Ha :

i =
0 for at least one i
j =
0 for at least one j

Ha :

ij = 0 for at least one i, j pair

(19.32)

Data Layout and Experimental Design

The layout of the data table for the two-factor experiment is shown in Table
19.5. The experimental design is the randomized complete block design, twoway crossed, because each factor is combined (crossed) with each block, and
each treatment combination experiment to yield the observation, yij , is run
in random order.
Analysis
The analysis for the two-factor problem is based on similar orthogonal
data decomposition and sum of squares decomposition; the resulting ANOVA
table, consisting of terms for main eects, interaction eects, and error, is
shown in Table 19.6.

Design of Experiments

TABLE 19.5:

Data table
two-factor experiment
Factor B
1
2
Factor A
1
Y111 Y121
1
Y112 Y122
..
..
..
.
.
.
1
Y11r Y12r
2
Y211 Y221
2
Y212 Y222
..
..
..
.
.
.

TABLE 19.6:
Source of
Variation
Main Eect A
Main Eect B
2-factor
Interaction AB
Error
Total

2
..
.

Y21r
..
.

Y22r
..
.

a
a
..
.

Ya11
Ya12
..
.

Ya21
Ya22
..
.

Ya1r

Ya2r

813

for typical

Y1b1
Y1b2
..
.
Y1br
Y2b1
Y2b2
..
.

Y2br
..
.
Yab1
Yab2
..
.
Yabr

Two-factor ANOVA Table

Degrees of Sum of Mean
Freedom Squares Square
F
a1
SSA
M SA
M SA /M SE
b1
SSB
M SB
M SB /M SE
(a 1)
SSAB M SAB M SAB /M SE
(b 1)
SSE
ab(r 1)
SSE
(abr 1)
SSY

814

19.5

Random Phenomena

General Multi-factor Experiments

The treatment of experiments involving more than two factors is a natural

and straightforward extension of just-concluded two-factor discussion. Generally called factorial experiments, their distinguishing characteristic is that, as
with the two-factor case, the experimental design involves every factor-level
combination. Such designs are more precisely referred to as complete factorial
designs for the obvious reason that no factor-level combination is left unexplored. But it is this very characteristic that raises a potential problem: as the
number of factors increases, the total number of experiments to be performed
increases quite dramatically. For example, a case involving 4 factors, with 3
levels each will result in 34 = 81 total treatments, not counting replicates.
With 5 replicates (for determining the various interactions) the number of
experiments climbs to 405.
Even if resources were available to perform all these experiments (which
is doubtful in many practical circumstances), one of the truly attractive components of modern design of experiments is the underlying philosophy of acquiring the most informative data as judiciously as possible. With multi-factor
experiments, therefore, the issue is not so much whether the complete factorial
set should be run; the issue is how best to acquire the desired information as
judiciously as possible. Sometimes this may mean running the complete factorial set; frequently however, a carefully selected restricted set of treatments
can provide suciently informative data.
This is the motivation behind one of the most commonly employed designs
for multi-factor experiments. When attention is restricted to only two levels of
each factor, the result is a special case of general factorial experiments which,
for k factors, gives rise to 2k observations when all treatments are applied.
These are the 2k factorial designs that are very popular because they are very
ecient in how they generate remarkably informative data.

19.6
19.6.1

2k Factorial Experiments and Design

Overview

2k factorial designs are used to study the eect of k factors (and their
interactions) simultaneously rather than one-at-a-time. The signature characteristic is that they involve only two levels of each factor, (Low, High). This
restriction endows these designs with their key advantage: they are very economical, allowing the extraction of a lot of information with relatively few
experiments. There is also the peripheral advantage that the fundamental na-

Design of Experiments

High

B
Low
Low

815
A

+
+

High

FIGURE 19.6: 22 factorial design for factors A and B showing the four experimental

points; represents low values, + represents high values for each factor.

ture of the design lends itself to computational shortcuts. However, this latter
point is no longer of consequence, having been rendered irrelevant by modern
computer software. Finally, as we show a bit later, 2k factorial designs are
easily adapted to accommodate experiments involving only a fraction of the
total 2k experiments, especially when k is so large that even the reduced set
of 2k experiments becomes untenable.
For all their advantages, 2k factorial experiments also have some important disadvantages. The most obvious is that by restricting attention only to
two levels, we limit our ability to conrm the presence of non-linear responses
to changes in factor levels. The underlying assumption is that the relationship
between the response, Y , and the factors, xi , is approximately linear (plus
some possible interaction terms) over the range of the chosen factor levels.
When this is a reasonable assumption, nothing is more ecient than 2k factorial designs. In many practical applications, the recommendation is to use
the 2k factorial designs (or fractions thereof, see later) to begin experimental
investigations and then to augment with additional experiments if necessary.
Notation and Terminology
Because they involve investigations at precisely two levels of each factor, it is customary to use or 1 to represent the Low level and + or
+1 to represent the High level of each factor. In some publications including journal articles and textbooks, lower case letters a, b, c, . . . are used
to represent the factors, and treatment combinations are represented as:
(1), a, b, ab, c, ac, bc, abc, . . . representing, respectively, the all low, A only
high (every other factor low), B only high, A and B high, etc.
For example, a 22 factorial design involving two factors A and B is shown
in Fig 19.6. The design calls for a base collection of four experiments, the rst,
(, ) representing the (Low, Low) combination; the second, the (High, Low)
combination; the third, the (Low, High) combination, and nally, the fourth,
the (High, High) combination. A concrete illustration and application of this
design is presented in the upcoming Example 19.4.

816

Random Phenomena

Characteristics
The 2k factorial design enjoys some desirable characteristics that make it
particularly computationally attractive:
1. It is balanced: in the sense that there is an equal number of highs and
lows for each factor. If the factor terms in the design are represented
by 1, then, for each factor,

(19.33)
xi = 0
For example, summing down the column for each factor A and B in the
table on the right hand side of Fig 19.6 shown this clearly.
2. It is orthogonal: in the sense that the sum of the products of the coded
factors (coded as 1) is zero, i.e.,

(19.34)
xi xj = 0; i = j
Multiplying column A by column B in Fig 19.6 and summing conrms
this for this 22 example.
These two characteristics simplify analysis, allowing the separation of eects,
and making it possible to estimate each eect independently. For example,
with the 22 design of Fig 19.6, from Run #1 (, ) to Run #2 (+, ), only
factor A has changed. The dierence between the observations, y1 for Run #
1 and y2 for Run #2, is therefore a reection of the eect of changing A (while
keeping B at its low value). The same is true for Runs # 3 and 4, but at the
high level of B; in which case (y3 y4 ) provides another estimate of the main
eect of factor A. This main eect is therefore estimated from the results of
the 22 design as:
Main Eect of A =

1
[(y2 y1 ) + (y3 y4 )]
2

(19.35)

The other main eect and the two-way interaction can be computed from
similar considerations made possible by the transparency of the design.
When experimental data were analyzed by hand, these characteristics led
to the development of useful computational shortcuts (for example the popular
Yates algorithm); this is no longer necessary, because of computer software
packages.

19.6.2

Design and Analysis

The general postulated model for 2k factorial designs is:

y = 0 +

k

i=1

i xi +

k
k

i=1 j=1

ij xi xj +

k
k
k

i=1 j=1 =1

ij xi xj x + + (19.36)

Design of Experiments

817

a somewhat unwieldy looking equation that is actually quite simple for specic
cases. The parameter 0 represents the grand average; i is the coecient
related to the main eect of factor i; the double subscripted parameter ij is
related to the two-way interaction eect of the ith and j th factors; the triple
subscripted parameter ij is related to the three-way interaction eect of the
ith , j th and th factors, etc. Simpler, specic cases are illustrated with the
next example.
Example 19.3: TWO-FACTOR AND THREE-FACTOR 2k
FACTORIAL MODELS
Write the postulated 2k factorial models for the following two practical
experimental cases:
(1) Studying the growth of epitaxial layer on silicon wafer by chemical vapor deposition (CVD), where the response of interest, the epitaxial
layer thickness, Y , is believed to depend on two primary factors: deposition time, x1 , and arsenic ow rate, x2 .
(2) Characterizing the solubility, Y , of a globular protein as a function of three primary factors: pH, x1 ; salt concentration, x2 ; and temperature, x3 .
Solution:
The models are as follows: for the CVD experiment,
y = 0 + 1 x1 + 2 x2 + 12 x1 x2 +

(19.37)

1 is related to the main eect of deposition time on epitaxial layer

thickness; 2 to the main eect of arsenic ow rate; and 12 is the
single two-way interaction coecient representing how deposition time
interacts with arsenic ow rate.
For the globular protein solubility experiment, the model is:
y

0 + 1 x1 + 2 x2 + 3 x3 + 12 x1 x2 + 13 x1 x3
+23 x2 x3 + 123 x1 x2 x3 +

(19.38)

The objective is to collect data according to the 2k factorial design and

determine values for the indicated parameters by analyzing the data. Because
of the design, all the main eects and interactions are in fact estimated from
arithmetic means. Of course, not all parameters will be important; and, as
usual, which one is important and which one is not is determined from tests
of signicance that are based on the same ANOVA decomposition presented
above for the two-factor experiment, as well as on individual t-tests for the
parameter estimates. The null hypothesis is that all these parameters are
identically equal to 0; signicance is determined usually at the = 0.05 level.

19.6.3

Procedure

The general procedure for carrying out 2k factorial experiments is summarized here:

818

Random Phenomena

1. Create Design:
Given the number of factors, and the low and high levels for each factor,
use computer software (e.g. MINITAB, SAS, ...) to create the design.
Using these software packages to create the design is straightforward and
intuitive; the result is a design table showing each treatment combination
and the recommended run sequence. It is highly recommended to run the
experiments in random order, just as with the completely randomized
single-factor designs.
Creating the design requires determining how many replicates to run.
We present some recommendations shortly.
2. Perform Experiment:
This involves lling in the results of each experiment in the data sheet
created above.
3. Analyze Data:
Once the data is entered into the created design worksheet, all statistical
packages will carry out the factorial analyses, generating values for the
main eects and interactions, and providing associated signicance
levels. It is also recommended to carry out diagnostic checks to validate
the underlying Gaussian distributional assumption.
Sample Size Considerations
The discussion in Section 15.5 of Chapter 15 included practical considerations for determining sample sizes for carrying out hypothesis tests regarding means of normal populations. The same considerations can be extended
for balanced 2-level factorial designs. In the same spirit, let represent the
smallest shift from zero we wish to detect in these eects (i.e., the smallest
magnitude of the eect worth detecting). With , the standard deviation of
the random error component, , usually unknown ahead of time, we can invoke
the denition of the signal-to-noise ratio given in Chapter 15, i.e.,
SN =

(19.39)

and use it in conjunction with the following expression to determine a range

of recommended sample sizes (rounded up to the nearest integer).

2
2
7
8
<n<
(19.40)
SN
SN
(Compare with Eq (15.95).) The following table generated from Eq (19.40) is
roughly equivalent to Table 15.8.
SN
n

0.5
1.0
1.5
2.0
196-256 49-64 22-28 12-16

Design of Experiments

819

Alternatively, the sequence, Stat > Power and Sample Size > 2-level
Factorial Design in MINITAB will also provide recommended values.
The following example illustrates these principles.
Example 19.4: 22 FACTORIAL INVESTIGATION OF CVD
PROCESS2
In an experimental study of the growth of epitaxial layer on silicon wafer
by chemical vapor deposition (CVD), the eect of deposition time and
arsenic ow rate on the epitaxial layer thickness was studied in a 22
factorial experiment using the following settings:
Factor A: Deposition time (Short, Long)
Factor B: Arsenic Flow Rate (55%; 59%)
Designing for a signal-to-noise ratio of 2 leads to a recommended sample size 12 < n < 16, which calls for a minimum of three full replicates
of the four basic 22 experiments (and a maximum of four replicates).
The 22 factorial design, in 3 replicates, and the experimental results are
shown in the table below. Analyze the data and comment on the results.

RunOrder

1
2
3
4
5
6
7
8
9
10
11
12

Depo.
Time

Arsenic
Flow rate

1
1
1
1
1
1
1
1
1
1
1
1

Epitaxial
Layer Thick.
14.04
14.82
13.88
14.88
14.17
14.76
13.86
14.92
13.97
14.84
14.03
14.42

Solution:
First, to create the design using MINITAB, the required sequence is
Stat > DOE > Factorial > Create Factorial Design, which opens a
dialog box where all the characteristics of the problem are specied.
The result, which should be stored in a worksheet, is the set of 12
treatment combinations arising from 3 replicates of the base 22 design.
It is highly recommended to run the actual experiments in random
order. (MINITAB can provide a randomized order for the experiments
upon selecting the Randomize runs option.)
Once the experiments have been completed and the epitaxial
layer thickness measurements entered into the worksheet (as shown in
the table above), the sequence for analyzing the data is Stat > DOE
> Factorial > Analyze Factorial Design. The dialog box contains
2 Adapted

from an article in the AT & T Technical Journal

820

Random Phenomena
many options and the reader is encouraged to explore them all. The
basic MINITAB output is shown here.
Factorial Fit: Epitaxial Layer versus Depo. Time, Arsenic Flow
rate
Estimated Effects and Coefficients for Epitaxial Layer Thick.
(coded units)
Term
Effect
Coef
SE Coef
T
P
Constant
14.3825 0.04515 318.52 0.000
Depo. Time
0.7817
0.3908
0.04515
8.66 0.000
Arsenic Flow rate
-0.1017 -0.0508 0.04515
-1.13 0.293
Depo. Time*
0.0350
0.0175
0.04515
0.39 0.708
Arsenic Flow rate
S = 0.156418 R-Sq = 90.51% R-Sq(adj) = 86.96%
This MINITAB output lists both the coecient and eect associated with each factor and it should be obvious that one is twice
the other. The coecient represents the parameters in the factorial
model equation, e.g. as in Eq (19.37) and (19.38); from these equations,
we see that each coecient represents the change in y for a unit change
in each factor xi . On the other hand, what is known as the eect is
the change in y when each factor, xi , changes from its low value to its
high value; i.e., from 1 to +1. Since this is a change of two units in
x, the eects are therefore twice the magnitude of the coecients. The
constant term is unaected since it is not multiplied by any factor; it is
the grand average of the data.
Next, we observe that for each estimated parameter, there is an
associated t-statistic and corresponding p-value. These are determined
in the same manner as discussed in earlier chapters, using the Gaussian
assumption and the mean square error, M SE , as an estimator of S,
the standard deviation. The null hypothesis is that all coecients are
identically zero. In this specic case, only the constant term and the
Deposition time eect appear to be signicantly dierent from zero;
the Arsenic ow rate term and the two-factor interaction term appear
not to be signicant at the = 0.05 level. MINITAB also lists the
2
none of which appear to be
usual regression characteristics, R2 ; Radj
particularly indicative of anything unusual.
Finally, MINITAB also produces an ANOVA table shown here. This
is a consolidation of the detailed information already shown above. It
indicates that the main eects, as a composite (not separating one main
eect from the other) are signicant (p-value is 0.000); the two-way interaction is not (p-value is 0.708).
Analysis of Variance for Epitaxial
Source
DF Seq SS
Main Effects
2 1.86402
2-Way Interactions 1 0.00368
Residual Error
8 0.19573
Total
11 2.06343

Layer Thick. (coded units)

Adj SS
Adj MS
F
P
1.86402 0.932008 38.09 0.000
0.00368 0.003675 0.15 0.708
0.19573 0.024467

Design of Experiments

821

Thus, the nal result of the analysis is that if x1 represents deposition

time, and x2 , arsenic ow rate, then the estimated equation relating
these factors to y, the epitaxial layer thickness, is
y = 14.3825 + 0.3908x1

(19.41)

where, it must be kept in mind, that x1 is coded as 1 for the short

deposition time, and +1 for the long deposition time.

19.6.4

Closing Remarks

The best applications of 2k factorial designs are for problems with relatively small number of factors, say k < 5; and when the relationship between
y and the factors is reasonably linear in the Low-High range explored experimentally, with non-linearities limited to cross-terms in the factors (no
quadratic or higher order eects). The direct corollary is that practical problems not suited to 2k factorial designs include those with a large number
of potentially important factors; and those for which nonlinear relationships
are important to the investigation, for example, for applications in process
optimization.
For the latter group of problems, specic extensions of the 2k factorial
designs are available to deal with each problem:
1. Screening Designs (for a large number of factors);
2. Response Surface Methods (for more complex modeling and optimization).
We deal next with these designs in turn, beginning with screening designs.
Screening designs are experimental designs specically calibrated for selecting from among a large group of potential factors, only the few that are
truly important, prior to carrying out more detailed experiments. Such designs therefore involve signicantly fewer experimental runs than required by
the 2k factorial designs. They are created to avoid the expenditure of a lot
of experimental eort since the objective is quicker decisions on factor importance, rather than detailed characterization of eects on responses. The two
most popular categories are Fractional Factorial Designs and Plackett-Burman
designs.
While screening designs involve running fewer experiments than called for
by the 2k factorial designs, response surface designs involve judiciously adding
more experimental runs, to capture more complex relationships better.

822

19.7
19.7.1

Random Phenomena

Screening Designs: Fractional Factorial

Rationale

Investigations with large k require an exponential increase in the total

number of experimental runs if the full 2k factorial design is employed. With
an increase in the number of factors also comes an increase in the total number
of higher-order interactions. For example, for the case with 5 factors, with a
25 factorial design and the resulting base 32 experimental results, we are
able to estimate 1 overall average; 5 main eects; 10 2-factor interactions
(5 4/2); 10 3-factor interactions (5 4/2); 5 4-factor interactions; and 1 5factor interaction. This raises an important practical question: How important
or even physically meaningful are fourth- and higher-order interactions? The
answer is that in all likelihood, they will be either unimportant or at best
insignicant. The underlying rationale behind fractional factorial designs is to
give up ability to estimate (unimportant) higher order interaction eects in
return for fewer experimental runs when k > 5

19.7.2

Illustrating the Mechanics

Consider a case involving 4 factors, A, B, C, D, for which a full 24 factorial

design will require 16 base experiments (not counting replicates). Let us investigate the possibility of making do with half of the experiments, based on a 23
factorial design for the rst 3 factors A, B, C. This proposition may be possible
if we are willing to give up something in this reduced design to be used for the
other factor D. First, the base 23 factorial design for A, B and C is shown here:
Run #
A
B
1
1 1
2
1 1
3
1
1
4
1
1
5
1 1
6
1 1
7
1
1
8
1
1

C
1
1
1
1
1
1
1
1

Now suppose that we are willing to give up the ability to determine the
three-way interaction ABC, in return for being able to investigate the eect
of D also. In the language of fractional factorial design, this is represented as:
D = ABC

(19.42)

an expression to which we shall return shortly. It can be shown that the code
corresponding to ABC is obtained from a term-by-term multiplication of the

Design of Experiments

823

signs in the columns A, B, and C in the design table. Thus, for example,
the rst entry for run #1 will be 1 (= 1 1 1), run #2 will be 1
(= 1 1 1), etc. The result is the updated table shown here.
Run # A B C D
1
1 1 1 1
2
1 1 1
1
3
1
1 1
1
4
1
1 1 1
5
1 1
1
1
6
1 1
1 1
7
1
1
1 1
8
1
1
1
1
Thus, by giving up the ability to estimate the three-way interaction ABC,
we have obtained the 8-run design shown in this table for 4 factors, a 50%
reduction in the number of runs (i.e., a half-fraction of what should have been
a full 24 factorial design). This seems like a reasonable price to pay; however,
this is not the whole cost. Observe that the code for the two-way interaction
AD (obtained by multiplying each term in the A and D columns) written
horizontally, is (1, 1, 1, 1, 1, 1, 1, 1) which is precisely the same as the
code for the two-way interaction BC! But even that is still not all. It is left
as an exercise to the reader to show that the code for AB is also the same as
that for CD; similarly, the codes for AC and for BD are also identical.
Observe therefore that for this problem,
1. The primary trade-o, D = ABC, allowed an eight-run experimental
design for a 4-factor system, precisely half of the 16 runs ordinarily
required for a full 24 design for 4 factors;
2. But D = ABC is not the only resulting trade-o; other trade-os include
AD = BC, AB = CD; and AC = BD; plus some others;
3. The implication of these secondary (collateral) trade-os is that these
two-way interactions, for example, AD and BC, are now confounded,
being indistinguishable from each other; they cannot be estimated independently.
Thus, when we give up some high-order interactions to estimate some other
factors, we also lose the ability to estimate other additional eects independently.

19.7.3

General characteristics

Notation and Alias Structure

The illustrative example 8-run factorial design for a 4-factor system shown
above is a half-factorial of a 24 design, called a 241 design. In general a 2kp

824

Random Phenomena

design is a (2p ) fraction of the full 2k design. For example, a 252 design is
a quarter fraction of the full 25 design which consists of 8 total runs (1/4 of
the full 32).
As illustrated above, the reduction in the total number of runs in 2kp
designs is achieved at a cost; this cost of fractionation, the confounding
of two eects so that they cannot be independently estimated, is known as
Aliasing. And for every fractional factorial design, there is an accompanying
alias structure, a complete listing of what is confounded with what. Such
alias structures can be determined from what is known as the dening relation,
an expression, such as the one in Eq (19.42) above, indicating the primary
trade-o.
There are simple algebraic rules for determining alias structures. For instance, upon multiplying both sides of Eq (19.42) by D an using the simple
rule that DD = I, the identity column, we obtain,
I = ABCD

(19.43)

This is the dening relation for this particular fractional factorial design. The
additional aliases can be obtained using the same algebraic rule: upon multiplying both sides of Eq (19.43) by A, and then by B, and then C, we obtain:
A = BCD; B = ACD; C = ABD

(19.44)

showing that, like the main eect D, the other main eects A, B, and C are
also confounded with the indicated three-way interactions. From here, upon
multiplying the expressions in Eq (19.44) by the appropriate letters, we obtain
the following additional aliases:
AB = CD; AC = BD; AD = BC

(19.45)

Observe that for this design, main eects are confounded with 3-way interactions only; and 2-way interactions are confounded with other 2-way interactions.
Design Resolution
The order of the eects that are aliased is captured succinctly in the design resolution. For example, the illustration used above, is a Resolution
IV design because 2-way interactions are confounded with other 2-way interactions, and main eects (1-way interactions) are confounded with 3-way
interactions.
The resolution of a design is typically represented by roman numerals, III,
IV, V, etc.; they dene the cost of fractionation. The higher the design
resolution, the better we are able to determine main eects, two-way interactions (and even 3-way interactions) independently. The following notation is
4
typical: 241
IV represents a Resolution IV, half fraction of a 2 factorial design,
such as the illustrative example used above. On the other hand, a 252
III design
is a Resolution III quarter fraction of a 25 design.

Design of Experiments

825

In general, an m-way interaction aliased with an n-way interaction implies

a Resolution (m + n). The following are some general remarks about design
resolutions:
1. Resolution III: Main eects (1-way) are aliased with 2-way interactions. Use only sparingly.
2. Resolution IV: Main eects (1-way) aliased with 3-way interactions;
and 2-way interactions aliased with 2-way interactions. This is usually
acceptable for many applications.
3. Resolution V: Main eects (1-way) aliased with 4-way interactions;
2-way interactions aliased with 3-way interactions. This is usually an
excellent choice whenever possible. Because one is usually not so much
concerned about 3- or higher- order interactions, main eects and 2-way
interactions that are of importance can be estimated almost cost-free.
Designs with resolution higher than V are economical, virtually cost-free
means of estimating main eects and 2-way interactions.

19.7.4

Design and Analysis

Basic Principles
The design and analysis of fractional factorial designs are very similar to
those for the basic factorial designs. Here are the key points of interest.
1. Planning: Determine k (total number of factors), and p (extent of fractionation), giving the total number of unreplicated runs; determine how
many replicates are required using the rule-of-thumb specied above or
the MINITAB power and sample size feature.
2. Design: Given the above information, typical computer software packages will display available designs and their respective resolutions; select
appropriate resolution (recommend IV or higher where possible). The
computer program will then generate the design and the accompanying
alias structure. It is important to save both into a worksheet.
3. Analysis: This follows the same principles as the full 2k designs, but in
interpreting the results, keep aliases in mind.
Before dealing with a comprehensive example, there are two more concepts of
importance in engineering practice we wish to discuss.
Projection and Folding
Consider, for example, that one or more factors in a half fraction of a 2k
design (i.e., a 2k1 design) is not signicant, we can then project the data

826

Random Phenomena
Alternate Half-fraction

Original Half-fraction

Combined Fold-Over
Design

FIGURE 19.7: Graphic illustration of folding where two half-fractions of a 23 factorial design are combined to recover the full factorial design; each fold costs an additional
degree of freedom for analysis.
down to a full factorial in the signicant k 1 variables. For instance, imagine that after completing the experiments in the 4-factor illustrative example
presented above, we discover that D was not signicant after all; observe that
the data can then be considered as arising from a full 23 factorial design in
A, B and C. If, for example, both C and D are not signicant, then the data
can be projected down two full dimensions so that the 8 experiments will be
considered as 2 full replicates of a full 22 factorial design in the signicant
factors A and B. Everywhere the factor C was investigated therefore becomes
a replicate. Where the opportunity presents itself, projection therefore always
strengthens the experimental analysis in the remaining factors. This is illustrated shortly with a concrete example.
The reverse is the case with folding combining lower fraction designs
into higher fraction ones. For example, combining two 1/2 fractions into a
full factorial design, or two 1/4 fractions into a 1/2 fractional factorial design.
This is illustrated in Fig 19.7. This strategy is employed if, after the analysis of
the fractional factorial design results, we discover that some of the confounded
interactions and eects are important enough to be determined independently.
Folding increases the design resolution by providing additional information
required to resolve aliases. However, each fold costs an additional degree of
freedom for analysis.

Design of Experiments

827

While we are unable to accommodate additional detailed discussions of

fractional factorial designs in such an introductory chapter as this, we use the
following example to illustrate its practical application.

19.7.5

A Practical Illustrative Example

Problem Statement
The problem involves a single-wafer plasma etcher process which uses the
reactant gas, C2 F6 . The response of interest is the etch rate for Silicon Nitride
(in
A/min), which is believed to be dependent on the 4 factors listed below.
A: Gap; cm. (The spacing between the Anode and Cathode)
B: Reactor Chamber Pressure; mTorr.
C: Reactant Gas (C2 F6 )Flow rate; SCCM
D: Power (applied to Cathode); Watts.
The objective is to determine which factors aect etch rate by investigating
the process response at the values indicated below for the factors using a 241
design with no replicates.
Variable

Levels
1 + 1
A. Gap (cm)
0.8 1.2
B. Pressure (mTorr)
450 550
C. Gas Flow Rate (sccm) 125 200
D. Power (Watts)
275 325
Design and Data Collection
The required design is created in MINITAB using the sequence Stat >
DOE > Factorial > Create Factorial Design > and entering the problem
characteristics. MINITAB returns both the design (which is saved into a worksheet) and the following characteristics, including the alias structure.
Fractional Factorial Design
Factors:
4 Base Design:
4, 8 Resolution: IV
Runs:
8 Replicates:
1
Fraction: 1/2
Blocks: none Center pts (total):
0
Design Generators: D = ABC
Alias Structure
I + ABCD
A + BCD

828

Random Phenomena

B + ACD
C + ABD
D + ABC
AB + CD
AC + BD
AD + BC
Note that this is precisely the same design generator and alias structure as in
the Resolution IV example used above to illustrate the mechanics of fractional
factorial design generation.
The design table, along with the data acquired using the design are shown
below:
Std
Run Gap Pressure Gas
Order Order
Flow
1
5
0.8
450
125
2
7
1.2
450
125
3
1
0.8
550
125
4
8
1.2
550
125
5
6
0.8
450
200
6
4
1.2
450
200
7
2
0.8
550
200
8
3
1.2
550
200

Power Etch Rate

275
325
325
275
325
275
275
325

550
749
1052
650
1075
642
601
729

Note the dierence between the randomized order in which the experiments
were performed and the standard order.
Analysis Part 1
To analyze this data set, the sequence Stat > DOE > Factorial
> Analyze Factorial Design > opens a dialog box with several selfexplanatory options. Of these, we draw particular attention to the button
labeled "Terms." Upon selecting this button, a further dialog box is opened
in which the terms to be included in the analysis are shown. It is interesting
to note that the default already selected by MINITAB shows only the four
main eects, A, B, C, D and three two way interactions AB, AC and AD. The
reason, of course, is that everything else is aliased with these terms. Next,
the button labeled "Plots" allows one to select which plots to display. For
reasons that will become clearer later, we select for the Eect Plots the
Normal Plot option. The button labeled "Results" shows what MINITAB
will include in the output estimated coecients and ANOVA table, Alias
table with default interactions, etc. The results for this particular analysis are
shown here.
Results for: Plasma.MTW
Factorial Fit: Etch Rate versus Gap, Pressure, Gas Flow, Power

Design of Experiments

829

Estimated Effects and Coefficients for Etch Rate (coded units)

Term
Effect
Coef
Constant
756.00
Gap
-127.00 -63.50
Pressure
4.00
2.00
Gas Flow
11.50
5.75
Power
290.50 145.25
Gap*Pressure -10.00
-5.00
Gap*Gas Flow
-25.50 -12.75
Gap*Power
-197.50 -98.75
S = * PRESS = *
Alias Structure
I + Gap*Pressure*Gas Flow*Power
Gap + Pressure*Gas Flow*Power
Pressure + Gap*Gas Flow*Power
Gas Flow + Gap*Pressure*Power
Power + Gap*Pressure*Gas Flow
Gap*Pressure + Gas Flow*Power
Gap*Gas Flow + Pressure*Power
Gap*Power + Pressure*Gas Flow
Analysis of Variance for Etch Rate (coded units)
Source
DF Seq SS Adj SS Adj MS F P
Main Effects
4 201335 201335
50334 * *
2-Way Interactions 3
79513
79513
26504 * *
Residual Error
0
*
*
*
Total
7 280848
The rst thing we notice is that the usual p-values associated with the estimates are missing in both the Estimated Eects table, and in the ANOVA
table. Because there are no replicates, there is no independent way to estimate
the random error component of the data, from which the standard deviation,
s, is estimated, which is in turn used to determine signicance (p-values). This
is why no value is returned for S by MINITAB.
How then does one determine which eects are signicant under these
conditions? This is where the normal probability plot for eects becomes extremely useful. When eects are plotted on normal probability paper, the
eects that are not signicant will cluster around the value 0 in the middle
with the signicant ones separating out at the extremes. The signicant eects
are identied using a methodology developed by Lenth (1989)3 (a discussion
of which lies outside the intended scope of this chapter) and from there, an
appropriate distribution t can then be provided to the non-signicant eects.
3 Lenth, R.V. (1989). Quick and Easy Analysis of Unreplicated Factorials, Technometrics, 31, 469-473.

830

Random Phenomena
Normal Plot of the Effects
(response is Etch Rate, Alpha = .10)
99

Effect Ty pe
Not Significant
Significant

95
D

Percent

80
70

F actor
A
B
C
D

N ame
G ap
P ressure
G as F low
P ow er

60
50
40
30
A

20
10

-200

-100

100

200

300

Effect

FIGURE 19.8: Normal probability plot for the eects, using Lenths method to identify
A, D and AD as signicant.

Such a probability plot for this data set (using Lenths method to determine
the signicance eects) is shown in Fig 19.8, where the eects A and D, respectively, Gap and Power, are identied as important, along with the two-way
interaction AD (i.e., Gap*Power).
At this point, it is important to pause and consider the repercussions of
aliasing. According to the alias structure for this Resolution IV design, AD =
BC, i.e., it is impossible from this data, to distinguish between the Gap*Power
interaction eect and the Pressure*Gas Flow interaction eect. Thus, is the
identied signicant interaction the former (as indicated by default), or the
latter? This is where domain knowledge becomes crucial in interpreting the
results of fractional factorial data analysis. First, from pure common sense, if
Gap and Power are identied as signicant factors, it is far more natural and
more likely that the two-way interaction that is also of signicance will be
Gap*Power and not Pressure*Gas Flow. This common sense conjecture is in
fact corroborated by the physics of the process: the spacing between the anode
and cathode (Gap) and the power applied to the cathode are far more likely
to inuence etch rate than the interaction between Pressure and Gas Flow
rate, especially when none of these individual factors appear important by
themselves. We therefore conclude, on the basis of the factorial analysis, aided
especially by the normal probability plot of the eects that Gap and Power
are the important factors that aect etch rate. This nding presents us with a
fortuitous turn of events: we started with 4 potential factors, performed a set of
8 experiments based on a 241 fractional factorial design (with no replicates),
and discovered that only 2 factors are signicant. This immediately suggests
projection. By projecting down onto the two relevant dimensions represented

Design of Experiments

831

by A and D, the 8 experiments will appear as if they were obtained from a

22 full factorial design (4 experiments) with 2 full sets of replicates, thereby
allowing us to obtain estimates of the experimental error standard deviation,
which in turn can be used to determine the precision of the eect estimates.
Analysis Part II: Projection
As noted above, with only A and D as the surviving important factors, we
can reanalyze the data only on these terms. Projection is carried out by removing all terms containing B and C including AB AC, etc. from the "Terms"
dialog box. Upon selecting the button labeled "Results" afterwards, one nds
that the only surviving terms are the desired ones: A, D and AD. When the
analysis is repeated, we obtain the following results:
Factorial Fit: Etch Rate versus Gap, Power
Estimated Effects and Coefficients for Etch Rate (coded units)
Term
Effect
Coef SE Coef
T
P
Constant
756.00
7.494 100.88 0.000
Gap
-127.00 -63.50
7.494
-8.47 0.001
Power
290.50 145.25
7.494 19.38 0.000
Gap*Power -197.50 -98.75
7.494 -13.18 0.000
S = 21.1955 PRESS = 7188
R-Sq = 99.36% R-Sq(pred) = 97.44% R-Sq(adj) = 98.88%
Analysis of Variance for Etch Rate (coded units)
Source
DF Seq SS Adj SS Adj MS
F
P
Main Effects
2 201039 201039 100519 223.75 0.000
2-Way Interactions 1
78013
78013
78013 173.65 0.000
Residual Error
4
1797
1797
449
Total
7 280848
It should not be lost on the reader that the estimates have not changed;
only now we have associated measures of their precisions. (The SE Coef
term refers to the standard error associated with the coecients, from which,
as discussed in Chapters 14 and 15, one obtains 95% condence intervals.)
We also now have associated p-values. These additional quantities have been
made available courtesy of the projection.
The conclusion is not only that the two most important factors are Gap
and Power, but that the estimated relationship between Etch rate, y, and
these factors, with x1 as Gap and x2 as Power, is :
y = 756.00 63.50x1 + 145.25x2 98.75x1 x2 +

(19.46)

The R2 value and other aliated measures of the variability in the data explained by this simple model, are also contained in the MINITAB results
shown above. These values indicate, among other things, that the amount of

832

Random Phenomena
Normal Probability Plot
(response is Etch Rate)
99

95
90

Percent

80
70
60
50
40
30
20
10
5

-40

-30

-20

-10

0
Residual

FIGURE 19.9: Normal probability plot for the residuals of the Etch rate model in Eq
(19.46) obtained upon projection of the experimental data to retain only the signicant
terms A, Gap (x1 ), D, Power (x2 ), and the interaction AD, Gap*Power (x1 x2 ).
the variabilty in the data explained by this simple model is quite substantial. A normal probability plot of the estimated model residuals is shown in
Fig 19.9, where visually, we see no reason to question the normality of the
residuals.

19.8

Screening Designs: Plackett-Burman

As useful as fractional factorial designs are, the voids in the prescribed

number of experimental runs widen substantially as the number of factors k
increases. This is because the experimental runs are limited to powers of 2 (i.e.,
4, 8, 16, 32, 64, 128 etc.). This leaves the experimenter with limited options
for larger k. For example, for k = 7, if the total number of runs, N = 128 are
too many, the next options are 64, or 32.
If necessity is the mother of invention, then it is no surprise that these
voids were lled in 1946 by two British scientists, Robin L. Plackett and
J. Peter Burman4 , who were part of a team working on the development
of anti-aircraft shells during the World War II bombing of London. The time
constraint faced by the team necessitated the screening of a very large number
of potential factors very quickly. The result is that voids left by fractional
4 Plackett, R.L. and J. P. Burman (1946). The design of optimum multifactorial experiments. Biometrika, 33, 305325

Design of Experiments

833

factorial designs can now be lled by what has become known appropriately
as Plackett-Burman (PB) designs because they involve 2r experimental runs,
where fractional factorial designs involve 2r experimental runs. Thus, with PB
designs, experimental runs of sizes N = 12, 20, 24, 28, etc. are now possible.

19.8.1

Primary Characteristics

PB designs are also 2-level designs (like fractional factorials), but where
fractional factorial designs involve runs that are powers of 2, PB designs have
experimental runs that are multiples of 2. All PB designs are of Resolution III,
however, so that all main eects are aliased with two-way interactions. They
should not be used therefore when 2-way interactions might be as important
as main eects.
PB designs are best used to screen for critical main eects when the number of potential factors is large. The primary advantage is that they involve
remarkably few runs for a large number of factors; they are therefore extremely
ecient and cost-eective. For example, the following is the design table for
a PB design of 12 runs for k = 7 factors!
Run # A B C D
1
+ +
2
+ + +
3
+ +
4
+ + +
5
+ + +
6
+ + +
7
+ + +
8
+ +
9
+
10
+
11
+
12

+
+

+
+
+

F G
+

+
+
+
+ +
+
+
+ +
+ +

The main disadvantages have to do with their resolution: with PB designs, it is dicult, if not entirely impossible, to determine interaction eects
independently. Furthermore the alias structures are quite complicated (but
important). The designs have also been known to be prone to poor precision,
but this can be mitigated with the use of replicates. It is recommended that
computer software be used for both design and analysis of experiments using
the PB strategy.

19.8.2

Design and Analysis

Programs such as MINITAB will generate PB designs and the accompanying alias structure. Because they are orthogonal two-level designs, the analysis

834

Random Phenomena

is similar to that for factorials. But, we emphasize again, that these are best
carried out using computer programs.
Additional discussions are available in Chapter 7 of the Box, Hunter and
Hunter, (2005) reference provided earlier. An application of PB designs in
biotechnology discussed in Balusu, et al., 20045 is highly recommended to the
interested reader.

19.9

Response Surface Designs

Frequently, the objective of the experimental study is to capture the relationship between the response y and the factors xi mathematically so that
the resulting model can be used to optimize the response with respect to the
factors. Under these circumstances, for such models to be useful, they will
have to include more than the approximate linear ones possible with two-level
designs. Response surface methodology is the approach for obtaining models
of the sort required for optimization studies. They provide the designs for eciently tting more complex models to represent the relationship between the
response, Y , and the factors, with the resulting model known as the response
surface. A detailed discussion is impossible in this lone section of an introductory chapter devoted to the topic. The classic reference is the book by Box
and Draper6 that the interested reader is encouraged to consult. What follows
is a summary of the salient features of this experimental design strategy that
nds important applications in engineering practice.

19.9.1

Characteristics

With response surface designs, each factor is evaluated at three settings,

generically denoted , 0 and +, the Low, Medium and High levels.
These designs are most applicable when factors are continuous variables as
opposed to categorical variables because the response surfaces are assumed
to be smooth. This is not the case with categorical variables; for example,
there is no interval to speak of between Type A and Type B fertilizer, as
opposed to the interval between temperature at 100C and 250C, which is
continuous.
The key assumption is that the relationship between response and factors
(i.e., the response surface) can be approximated well by low-order polynomials;
often, no more than second-order polynomials are used to capture any response
5 Balusu R., R. M. R. Paduru, G. Seenayya, and G. Reddy, (2004). Production of ethanol
from Clostridium thermocellum SS19 in submerged fermentation: Screening of nutrients
using Plackett-Burman design. Applied Biochem.& Biotech., 117 (3) 133-142.
6 G.E.P. Box and N.R. Draper (1987). Empirical Model Building and Response Surfaces,
J. Wiley, N.Y.

Design of Experiments
Run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

X1
+
+
+
+
+
0
0
0
0
0
0
0

X2
+
+
+
+
0
0
+
0
0
0
0
0

X3
+
+
+
+
0
0
0
0
+
0
0
0

835

23 Factorial

X3
Face
Centers

Center
Points

X2
X1

FIGURE 19.10: The 3-factor face-centered cube (FCC) response surface design and

its constituent parts: 23 factorial base, Open circles; face center points, lighter shaded
circles; center point, darker solid circle.

surface curvature. For example, the typical postulated model for two-factors
is
(19.47)
y = 0 + 1 x1 + 2 x2 + 12 x1 x2 + 11 x21 + 22 x22 +

19.9.2

Response Surface Designs

Of the many available response surface designs, the face-centered cube

(FCC) design (part of the family of Central Composite designs) and the BoxBehnken design are the most commonly employed. The face-centered cube
design, as its name implies, is based upon adding axial and central experimental points to the basic 2k factorial design. For example, for 3 factors, the
face-centered cube design is shown in Fig 19.10. The three components of the
design are represented with circles: the open circles represent the base 23 factorial design; the lighter shaded circles are the face centers, with the darker
solid circle as the dead center of the cube. The standard design shown calls
for replicating the center point three times for a total of 17 runs.
When corner points are infeasible, (for example, when the high temperature, high pressure and high concentration combination will lead to an explosion, and/or when the low temperature, low pressure, and low catalyst
concentration will lead to reaction extinction) the Box-Behnken design is a
viable option. Because this design is based not on face centers but on edgecenters, it avoids the potential problems caused by infeasible corner points.
An example 3-factor, Box-Behnken design is shown in Fig 19.11. The design
is based on three 22 factorial designs moved to the edge center of the third

836

Random Phenomena

Run

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

+
+
+
+
0
0
0
0
0
0
0

+
+
0
0
0
0
+
+
0
0
0

0
0
0
0
+
+
+
+
0
0
0

X1 X2
Factorial

X1 X3
Factorial

X2 X3
Factorial

Center
Points

X2
X1

FIGURE 19.11: The 3-factor Box-Behnken response surface design and its constituent

parts: X1 , X2 : 22 factorial points moved to the center of X3 to give the darker shaded
circles at the edge-centers of the X3 axes; X2 , X3 : 22 factorial points moved to the
center of X1 to give the lighter shaded circles at the edge-centers of the X1 axes;
X1 , X3 : 22 factorial points moved to the center of X2 to give the solid circles at the
edge-centers of the X2 axes; the center point, open circle.

factor and point at the dead center of the cube. Thus, the rst four runs are
based on the four X1 , X2 factorial points moved to the center of X3 to give
the dark shaded circles at the edge-centers of the X3 axes. The next four are
based on the four four X2 , X3 factorial points moved to the center of X1 to
give the lighter shaded circles at the edge-centers of the X1 axes. The next
four likewise are based on the to give the X1 , X3 factorial points moved to
the center of X2 solid circles at the edge-centers of the X2 axes. The center
points are the open circles. The design also calls for three replicates of the
center points. Note that while the three-factor FCC design involves 17 points,
the Box-Behnken design involves 15.

19.9.3

Design and Analysis

As with every experimental design discussed in this chapter, programs such

as MINITAB will generate response surface designs and also carry out the
necessary analysis once the data has been collected. It is important to note
that in coded form (using 1, +1 and 0), the design matrix is orthogonal.
The response surface analysis is therefore similar to polynomial regression
discussed in Chapter 16, but it is especially simplied because the design
matrix is orthogonal.
Additional discussions are available in the Box and Draper reference given
earlier, and also, for example in Chapter 12 of Box, Hunter, and Hunter,

Design of Experiments

837

2005, and Chapter 10 of Ryan, 20077 . An application to catalyst design may

be found in Hendershot, et al., 20048 . Another application in food engineering
is available in Pericin, D. et al., 20079. An application to industrial process
optimization is discussed as a case study in Chapter 20.

19.10
19.10.1

Introduction to Optimal Designs

Background

Whether stated explicitly or not, all statistical experimental design strategies are based on assumed mathematical models. To buttress this point, at
every stage of the discussion in this chapter, we have endeavored to state explicitly the postulated model underlying each design. Most of the models we
have encountered have been relatively simple, involving at most polynomials
of modest order. What happens if an experimenter has available a model for
the system under investigation, and the objective is to estimate the unknown
model parameters using experimental data? Under such circumstances, the
appropriate experimental design should be one that provides for the given
model the best set of experimental conditions so that the acquired data
is most informative for the task at hand: estimating the unknown model
parameters. This is the motivation behind optimal designs.
If the postulated model is of the form:
y = X +

(19.48)

i.e., the matrix form of the general linear regression model, we may recall
that, given the data vector, y, and the design matrix X, the least-squares
estimate of the parameter vector is:
1

= XT X
XT y
(19.49)

Optimal experimental designs for problems of this kind are concerned with
determining values for the elements of the matrix X that will provide estimates
in Eq (19.49) that are optimal in some specic sense. Not surprisingly, the
optimality criteria are usually related to the matrix

FI = XT X
(19.50)
known as the Fisher information matrix.
7 Ryan,

T.P. (2007). Modern Experimental Design, John Wiley, New Jersey.

J. Hendershot, W. B. Rogers, C. M. Snively, B. A. Ogunnaike, and J. Lauterbach,
(2004). Development and optimization of NOx storage and reduction catalysts using statistically guided high-throughput experimentation, Catalysis Today 98, 375385.
9 Pericin, D. et al., (2007). Evaluation of solubility of pumpkin seed globulins by response
surface method. J. Food Engineering. 84, 591-594.
8 R.

838

Random Phenomena

19.10.2

Alphabetic Optimal Designs

The D-Optimal design selects values of the factors x to maximize XT X,
the determinant of the information matrix. This optimization criterion maximizes the information content of the data and hence of the estimate. It is
possible to show that for the factorial models in Eq (19.36), with k factors,
the D-Optimal design is precisely the 2k factorial design, where 1 and +1
represent, respectively, the left and right extremes of the feasible region for
each factor, giving a k-dimensional hypercube in the experimental space.
Other optimality criteria give rise to variations in the optimal design arsenal. Some of these are listed below, (in alphabetical order!), the optimality
criteria themselves and what they mean for the parameter estimates:
1

. And because
1. A-Optimal designs: minimize the trace of XT X
(as we may recall from Chapter 16),

1
= XT X
V ar()
2

(19.51)

with 2 as the error variance, then the A-optimality criterion minimizes the average variance of the estimated coecients, a desirable
property.
Since the inverse of a matrix is its adjoint divided by its determinant,
it should be clear therefore that the D-Optimality criterion minimizes
the general variance of the estimated parameters.

2. E-Optimal designs: maximize the smallest eigenvalue of XT X . This
is essentially the same as maximizing the condition number of the design matrix X, preventing the sort of ill-conditioning that leads to poor
estimation. This is a less-known, and less popular design criterion.
3. G-Optimal designs: minimize the maximum diagonal element of the
1

XT , which, as
hat matrix dened in Chapter 16 as H = X XT X
we may recall, is associated with the models predicted values. The Goptimality criterion minimizes the maximum variance associated with
.
the predicted values, y
4. V-Optimal designs: minimize the trace (sum of the diagonal elements)
of the hat matrix. As such, the V-optimality criterion minimizes the
.
average prediction variance associated with y
Even with such a simple linear model, the computations involved in obtaining these optimal designs are not trivial; and things become even more complicated when the models are non-linear in the parameters. Also, these optimal
designs have come under some criticism for various reasons (see Chapter 13 of
Ryan, (2007)). Nevertheless, they have been put to good use in some chemical

Design of Experiments

839

engineering applications, for example, Asprey and Machietto, (2000)10 and

(2002)11 . They have also inspired experimental design techniques for such
challenging problems as signal transduction model development in molecular
biology (Birtwistle, et al., (2009)12 ). The most current denitive text on the
subject matter is Atkinson et al., (2007)13 , which is recommended highly to
the interested reader.

19.11

Summary and Conclusions

This chapter has been primarily concerned with providing something of an

extended overview of strategies for designing experiments that will generate
informative data sets most eciently. The discussions took us rapidly from
designs for simple, one-factor experiments through the popular factorial and
fractional factorial designs. And even though the coverage is by no means comprehensive in terms of breadth, there is sucient detail in what was presented.
The discussions about sample size issues and about folding and projection may
have been brief but they are important, with signicant implications in practice. Some in-chapter examples and a handful of end-of-chapter exercises and
applications problems illustrate this point. The deliberately abbreviated discussions of Plackett-Burman designs and response surface methods should be
viewed as an attempt to whet the appetite of the interested reader; and if such
a reader then chooses to pursue the topic further (for example, in textbooks
dedicated to these matters), the introduction in this chapter should facilitate
such an endeavor.
With the discussion of experimental designs in this chapter complementing
the earlier discussions of both descriptive statistics (Chapter 12) and inductive (i.e., inferential) statistics (Chapters 1318), our intended coverage of the
very broad subject matter of statistics is now formally complete. Even though
we have presented principles and illustrated mechanics and applications with
appropriate examples at various stages of our coverage, an appropriate capstone that will serve to consolidate all the material is still desirable. The next
chapter (Chapter 20) provides such a capstone. It contains a series of case
studies carefully chosen to demonstrate the application of the various aspects
10 Asprey, S. and Macchietto, S. (2000). Statistical tools for optimal dynamic model
building. Computers and Chemical Engineering 24, 1261-1267.
11 Asprey, S. and Macchietto, S.(2002). Designing robust optimal dynamic experiments.
Journal of Process Control 12, 545-556.
12 Birtwistle, M. R., B. N. Kholodenko, and B. A. Ogunnaike, (2009). Experimental
Design for Parameter Identiability in Biological Signal Transduction Modeling, Chapter
10, Systems Analysis of Biological Networks, Ed. A. Jayaraman and J. Hahn, Artech House,
London.
13 Atkinson, A. A. Donev, and R. Tobias, (2007). Optimum Experimental Designs, with
SAS, Oxford University Press, NY.

840

Random Phenomena

of statisticsgraphical analysis; estimation; hypothesis testing; regression; experimental designin real-life applications.

REVIEW QUESTIONS
1. In what sense is the term experimental studies used in this chapter?
2. What is an observational study?
3. What is the key distinguishing characteristic of the experimental studies of concern in this chapter?
4. What are the two basic tasks involved in the experimental studies discussed in
this chapter? What complicates the eective execution of these tasks?
5. How does statistical design of experiments enable ecient conduct of experimental studies?
6. List the phases of ecient experimental studies and what each phase entails.
7. In the terminology of statistical design of experiments, what is a response, a factor, a level, and a treatment?
8. When more than two population means are to be compared simultaneously, why
are multiple pairwise t-tests not recommended?
9. What is ANOVA, and on what is it predicated?
10. What are the two central assumptions in ANOVA?
11. What is a one-way classication of a single factor experiment, and how is it
dierent from a two-way classication?
12. What is the postulated model and what are the hypotheses for a one-way classication experiment?
13. What is the completely randomized design, and why is it appropriate for the
one-way classication experiment?
14. What makes a one-way classication experimental design balanced as opposed
to unbalanced?
15. What is the ANOVA identity, and what is its primary implication in data analysis?

Design of Experiments

841

16. What is the dierence between a xed eect and a random eect ANOVA?
17. What is the Kruskal-Wallis test?
18. What is the randomized complete block design?
19. What is the postulated model and what are the hypotheses for the randomized
complete block design?
20. What is a nuisance variable?
21. What is the dierence between a 2-way classication of a single factor experiment and a two-factor experiment?
22. What is the postulated model and what are the hypotheses for a two-factor
experiment?
23. What is the potential problem with general multi-level multi-factor experiments?
24. What is a 2k factorial experiment?
25. What are some of the advantages and disadvantages of 2k factorial designs?
26. What does it mean that 2k factorial designs are balanced and orthogonal?
27. What is the general procedure for carrying out 2k factorial experiments?
28. 2k factorial designs are best applied to what kinds of problems?
29. What are screening designs and what is the rationale behind them?
30. In fractional factorial designs, what is aliasing, and what is an alias structure?
31. What is a dening relation?
32. What is the resolution of a fractional factorial design?
33. What are the main characteristics of a Resolution III, a Resolution IV, and a
Resolution V design?
34. If one is interested in estimating both main eects and 2-way interactions, why
should one not use a Resolution III design?
35. What is projection, and under what condition is it possible?
36. What is folding?

842

Random Phenomena

37. What problem with fractional factorial design is ameliorated with PlackettBurman designs?
38. What is the resolution of all Plackett-Burman designs?
39. What are Plackett-Burman designs best used for?
40. What are response surface designs and what distinguishes them from basic 2k
factorial designs?
41. What is a typical response surface design model for a two-factor experiment?
42. What is the dierence between a face centered cube design and a Box-Behnken
design?
43. When is a Box-Behnken design to be preferred over a face centered cube design?
44. What is an optimal experimental design?
45. What is the Fisher information matrix for a linear regression model?
46. What optimization criteria lead respectively to D-Optimal, A-Optimal, EOptimal, G-Optimal and V-Optimal designs?

EXERCISES
19.1 In each of the following, identify the response, the factors, the levels, and the
total number of treatments. Also identify which variables are categorical and which
are quantitative.
(i) A chemical engineering catalysis researcher is interested in the eect of N O concentration, (3500 ppm, 8650 ppm) ; O2 concentration (4%, 8%); CO concentration,
(3.5%, 5.5%); Space velocity, (30,000, 42,500) mL/hr/gcat ; Temperature, (548 K,
648 K); SO2 concentration, (0 ppm, 300ppm); and Catalyst metal type, (Pt, Ba,
Fe), on saturation N Ox storage ( mol).
(ii) A material scientist studying a reactive extrusion process is interested in the
eect of screw speed, (135 rpm, 150 rpm), feed-rate, (15 lb/hr, 25 lb/hr), and feedcomposition, %A, (25, 30, 45) on the residence time distribution, f (t).
(iii) A management consultant is interested in the risk-taking propensity of three
types of managers: entrepreneurs, newly-hired managers and newly promoted managers.
(iv) A child psychologist is interested in the eect of socio-economic status of parents
(Lower class, Middle class, Upper class), Family size (Small, Large) and Mothers
marital status (Single-never married, Married, Divorced), on the IQ of 5-year-olds.

Design of Experiments

843

19.2 Consider the postulated model for the single-factor, completely randomized
experiment given in Eq (19.2):
Yij = j + ij ; i = 1, 2, . . . , nj ; j = 1, 2, . . . , k
where j is the mean associated with the j th treatment; furthermore, let the grand
mean of the complete data set be .
(i) If the j th treatment mean is expressed as:
j = + j ; j = 1, 2, . . . , k
so that j represents the j

treatment eect, show that

j = 0

(19.52)

j=1

and hence establish Eq (19.52).

(ii) With the randomized complete block design, the postulated model was given as:
Yij = + j + i + ij ; i = 1, 2, . . . , r; j = 1, 2, . . . , k
where j , is the j th treatment eect; i is the ith block eect and ij is the random
error. Show that
k
r

j = 0;
i = 0
j=1

i=1

and hence establish Eq (19.20).

19.3 To demonstrate the mechanics of the ANOVA decomposition, consider the
following table of obviously made-up data for a ctitious completely randomized
design experiment (one-way classication) involving 2 levels of a single factor:
Sample
1
2
3

Level 1
1
2
3

Level 2
4
5
6

(i) obtain, as illustrated in the text, the following (6-dimensional) vectors:

EY , whose elements are (Yij Y.. );
ET , whose elements are (Y.j Y.. ); and
EE , whose elements are (Yij Y.j ).
(ii) Conrm the error decomposition identity:
EY = ET + EE

(19.53)

(iii) Conrm the orthogonality of ET and EE , i.e. that,

ETT EE = 0

(19.54)

(iv) Finally conrm the sum-of-squares identity:

EY 2 = ET 2 + EE 2

(19.55)

844

Random Phenomena

where the squared norm of an n-dimensional vector, a, whose elements are ai : i =

1, 2, . . . , n, is dened as:
n

a2i
(19.56)
a2 =
i=1

19.4 Consider the response Yij in a single-factor completely randomized experiment.

From the denition given in the text of the treatment average, Y.j , and the grand
average, Y.. , the error decomposition expression given in Eq (19.11) is
(Yij Y.. )

(Y.j Y.. ) + (Yij Y.j )

ET + EE

Take sums-of-squares in this equation and show that the result is the following sums
of squares identity:
nj
k

(Yij Y.. )2

SSY

SST + SSE

j=1 i=1

(Y.j Y.. )2 +

j=1

nj
k

(Yij Y.j )2

j=1 i=1

and hence establish Eq (19.12).

19.5 Consider the data shown in the table below, where Fi : i = 1, 2, 3, 4 represents
four factors; and Bi : i = 1, 2, 3, 4, 5 represents 5 block. The data were generated
from normal populations with mean 10 and standard deviation 2, except for column
3 for F3 , generated from a normal distribution with mean 12 and standard deviation
2. Also, the rst row was adjusted by adding 1.5 across the entire row.

B1
B2
B3
B4
B5

F1
14.5
10.6
12.0
9.0
11.6

F2
11.5
10.2
9.1
8.8
10.6

F3
12.7
13.6
14.6
12.2
15.3

F4
10.9
10.8
9.7
8.7
10.0

(i) First analyze the data as a one-way classication with ve replicates and
comment on the results, especially the p-value associated with the ANOVA F -test.
2
values. What do these values indicate about how much of the
Note the R2 and Radj
variation in the data has been explained by the one-way classication model?
(ii) Now analyze the data as a two-way classication, where the blocks are now explicitly recognized as such. Compare the results with those obtained in (i) above.
Comment on what this exercise indicates about what can happen when a nuisance
eect is not explicitly separated out from an analysis of a one-factor experiment.
19.6 Refer to Exercise 19.5 and the supplied data. Repeat the analysis, this time
saving the residuals in each case (one-way rst, and then two-way next). Carry out a
normality test on both sets of residuals, plot both residuals on the same graph, and
compare their standard deviations. Comment on what these residuals imply about
which ANOVA model more appropriately ts the data, especially in light of what is
known about how this particular data set was generated.

Design of Experiments

845

19.7 Refer to Example 19.2 in the text. Repeat the data analysis in the example and
save the residuals for analysis. Asses the normality of the residuals. Interpret the
results and discuss what they imply about the ANOVA model for the Tire wear data.
19.8 Write out the model for a 24 factorial design where the factors are x1 , x2 , x3
and x4 and the response is y.
(i) How many parameters are to be estimated in order to specify this model completely?
(ii) Obtain a base design for this experiment.
19.9 For each of the following base factorial designs, specify the number of replicates
required.
(i) 23 with signal-to-noise ratio specied as SN = 1.5
(ii) 22 with signal-to-noise ratio specied as SN = 1.5
(iii) 22 with signal-to-noise ratio specied as SN = 2.0
(iv) 24 with signal-to-noise ratio specied as SN = 1.5
19.10 A factorial experiment is to be designed to study the eect on reaction yield
of temperature x1 at 180 C and 240 C in conjunction with Pressure at 1 atm and
2 atm. Obtain a design in terms of the original variables with three full replicates.
19.11 The design matrix for a 22 factorial design for factors X1 and X2 is shown
below along with the measured responses yi : i = 1 4.
Run
1
2
3
4

X1
1
1
1
1

X2
1
1
1
1

Response
y1
y2
y3
y4

Given model equation:

y = 0 + 1 x1 + 2 x2 + 12 x1 x2 +
it is possible to use the regression approach of Chapter 16 to obtain the estimates
of the model parameters as follows:
(i) Write the model equation in vector-matrix form

0
1
y1
1 2
y2

y3 = X 2 + 3
y4
12
4
i.e., as
y = X +

(19.57)

From the design table given above, determine the matrix X in terms of 1 and 1 .
(ii) From Chapter 16, we know that the least squares estimate of the unknown
parameters in Eq (19.57) is:
1

= XT X
XT y

846

Random Phenomena

Show that for this particular 22 factorial design model, the least squares solution,
is given as follows:
,

1
(y1 + y2 + y3 + y4 )
4
1
(y1 + y2 y3 + y4 )
4
1
(y1 y2 + y3 + y4 )
4
1
(y1 y2 y3 + y4 )
4

(iii) Examine the design table shown above and explicitly associate the elements of
this design table directly with the least squares solution in (ii); identify how such a
solution can be obtained directly from the table without necessarily going through
the least squares computation.
19.12 The following data was obtained from a 22 factorial experiment for factors A
and B with two replicate runs. Analyze the data and estimate the main eects and
the interaction term.

Run
1
2
3
4

A
1
1
1
1

B
1
1
1
1

Response
Replicate 1 Replicate 2
0.68
0.65
3.81
4.03
1.67
1.71
8.90
9.66

19.13 Generate a design table for a 24 factorial experiment for factors A, B, C and
D, to be used as a base design for a 251 half factorial design which now includes a
fth factor E.
(i) Use ABC = E to generate the 251 design and show the resulting table. Obtain
the alias structure for the remaining 3-factor interactions. What is the resolution of
this design? Which main eect can be estimated with no confounding?
(ii) This time, use BCD = E to generate the 251 design and repeat (i).
19.14 The following table shows the result of a full 32-run, 25 factorial experiment
involving 5 factors A, B, C, D and E.

Design of Experiments
Data
Run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Table
A
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

847

for Exercises 19.14, 19.15, 19.16

B
C
D
E
y
1 1 1 1
2.98
1 1 1 1
0.05
1 1 1 1
3.30
1 1 1 1
7.92
1
1 1 1
0.75
1
1 1 1
6.58
1
1 1 1
1.08
1
1 1 1
15.48
1 1
1 1
2.64
1 1
1 1
0.90
1 1
1 1
3.37
1 1
1 1
6.72
1
1
1 1
1.05
1
1
1 1
7.18
1
1
1 1
0.97
1
1
1 1
14.59
1 1 1
1
3.14
1 1 1
1
1.09
1 1 1
1
3.11
1 1 1
1
7.37
1
1 1
1
1.32
1
1 1
1
6.53
1
1 1
1
0.60
1
1 1
1
14.25
1 1
1
1
2.93
1 1
1
1
0.51
1 1
1
1
3.46
1 1
1
1
6.69
1
1
1
1
1.35
1
1
1
1
6.59
1
1
1
1
0.82
1
1
1
1
15.53

(i) Estimate all the main eects and interactions. (Of course, you should use a computer program.)
(ii) Since no replicates are provided, use the normal probability plot and Lenths
method (which should be available in your computer program) to conrm that only
the main eects, A, B, and C, and the two-way interactions, AB and AC, are signicant.
(iii) In light of (ii) project the data down appropriately and reanalyze the data.
This time note the uncertainty estimates associated with the estimates of the main
eects and interactions; note also the various associated R2 values and comment on
the t of the reduced model to the data.
19.15 Refer to Exercise 19.14 and the accompanying data table. This time create
a 251 fractional factorial design and pretend that the experimenter had used this
fractional factorial design instead of the full one. Extract from the full 32-run data
table the results corresponding to the 16 runs indicated in the created 251 design.
(i) Determine the design resolution and the alias structure.
(ii) Repeat the entire Exercise 19.14 for only the indicated 16 experimental re-

848

Random Phenomena

sults. Compare the results of the analysis with that in Exercise 19.14. Does the
experimenter lose anything signicant by running only half of the full 25 factorial
experiments? Discuss what this particular example illustrates regarding the use of
fractional factorial designs when the experimental study involves many factors.
19.16 Refer to Exercise 19.14 and the accompanying data table. Create an 8-run,
252 fractional factorial design and, again as in Exercise 19.15, pretend that the experimenter has used this design instead of the full one. Extract from the full 32-run
data table, the results corresponding to the 8 runs indicated in the 252 design.
(i) Determine the design resolution and alias structure. Can any two-way interactions be determined independently without confounding? What does this imply in
terms of what can be truly estimated from this much reduced set of results?
(ii) Repeat the entire Exercise 19.14. How do the estimates of the main eects compare to the ones obtained from the full data set?
(iii) If the experimenter had only been interested in determining which of the
5 factors has a signicant eect on the response, comment on the advantages/disadvantages of the quarter fraction, 8-run design versus the full 32-run experimental design.
19.17 Consider an experiment to determine the eect of two factors, x1 and x2 , on
the response Y . Further, consider that the postulated model is:
y = 0 + 1 x1 + 2 x2 + 12 x1 x2 +
Let the minimum allowable values for each factors be represented in coded form
as 1 and the maximum allowable values as 1. In this case, the 22 factorial design
recommends the following settings for the factors:
X1
1
1
1
1

X2
1
1
1
1

(i) Show that for this 22 design, if the model is written in the matrix form:
y = X +
the Fisher information matrix (FIM) is given by

4 0

0 4
T
FI = X X =
0 0
0 0

0
0
4
0

0
0

0
4

a diagonal matrix; hence establish that the determinant is given by:

T
4
X X = 4
(ii) For any other selection of orthogonal experimental points for x1 and x2 , such as
shown in the table below,

Design of Experiments
X1

849

where 0 < < 1, show that the determinant of the FIM will be

T
2 4
X X = (4 )
Because 0 < < 1, this determinant will be signicantly less than the determinant
of the FIM for the factorial design. (Since it can be shown that any other selection
of four non-orthogonal points in the region bounded by the rectangle in <
x1 < ; < x2 < in the x1 -x2 space will lead to even smaller determinants
for the resulting FIM, this implies that for the given two-factor experiment, and
the postulated model shown above, the 22 factorial design maximizes the FIM and
hence is the D-Optimal design for this problem.)

APPLICATION PROBLEMS
19.18 The following table of data from Nelson (1989)14 shows the cold cranking
power of ve dierent battery types quantied as the number of seconds that a
particular battery generated its rated amperage without falling below 7.2 volts, at
0 F. The experiment was replicated four times for each battery type.
Battery Type
Experiment No
1
2
3
4

41
43
42
46

42
43
46
38

27
26
28
27

48
45
51
46

28
32
37
25

When presented in Problem 12.22, the objective then was to identify any suggestion
of descriptive (as opposed to inductive) evidence in the data set to support the
postulate that some battery types are better than others.
(i) Now, use a one-way classication ANOVA to determine inductively if there is a
dierence in the cold cranking power of these battery types. What assumptions
are required for this to be a valid test? Are these assumptions reasonable?
(ii) From a box plot of the data set, which battery types appear to be dierent from
the others?
19.19 Refer to Problem 19.18 and the data table. Use a computer program to carry
out a nonparametric Kruskal-Wallis test. Interpret your result. What conclusion
does this result lead to regarding the equality of the cold cranking power of these
battery types? Is this conclusion dierent from the one reached in Problem 19.18
14 Nelson,

L.S., (1989). Comparison of Poisson means, J. of Qual. Tech., 19, 173179.

850

Random Phenomena

using the ANOVA method?

19.20 Moore et al. (1972)15 , present the following data on the weight gain (in
pounds) for pigs sampled from ve dierent litters.

L1
23
27
26
19
30

Litter Number
L2 L3 L4
29 38 30
25 31 27
33 28 28
36 35 22
32 33 33
28 36 34
30 34 29
31 32 30

L5
31
33
31
28
30
24

(i) At the = 0.05 signicance level, test the hypothesis that the litter from which
a pig is drawn has no eect on the weight gain. What is the conclusion of this test?
(ii) Had the test been conducted at the = 0.01 signicance level, what conclusions
would you have reached? Examine a box plot of the data and comment on what it
suggests about the hypothesis.
19.21 The table below shows the time in months between occurrences of safety violations for three operators, A, B, and C, working in a toll manufacturing
facility.
A
B
C

1.31
1.94
0.79

0.15
3.21
1.22

3.02
2.91
0.65

3.17
1.66
3.90

4.84
1.51
0.18

0.71
0.30
0.57

0.70
0.05
7.26

1.41
1.62
0.43

2.68
6.75
0.96

0.68
1.29
3.76

The data is clearly not normally distributed since the phenomenon in question is
typical of an exponential random variable. Carry about an appropriate test to validate the hypothesis that there is a dierence between the safety performance of
these operators. Justify your test choice and interpret your results adequately.
19.22 The following table, adapted from Gilbert (1973)16 , shows the sprinting speeds
(in feet/second) for three types of fast animals, classied by sex (male or female).
Analyze the data appropriately. From your analysis, what can you conclude about
the eect of animal type and sex on speed?

15 P.G. Moore, A.C. Shirley, and D.E. Edwards, (1972). Standard Statistical Calculations,
Pitman, Bath.
16 Gilbert, (1973). Biometrical Interpretation, Clarendon Press, Oxford.

Design of Experiments

851

Fast Animal Sprinting Speeds

Animal
Type Cheetah Greyhound Kangaroo
Sex
56
37
44
52
33
48
M
55
40
47
53
38
39
F
50
42
41
51
41
36
19.23 A study of the eects of 3 components A, B, and C on the etched grid line
width in a photolithography process was carried out using a 23 factorial design with
the following levels of each factor:
Factor
A
B
C

Low
0%
8%
0%

High
6%
16%
3%

Generate a standard (un-replicated) design and analyze the following results for
the response y = grid line width (in coded units) obtained in the series of 8 experiments run in randomized order, but which, for the purpose of this problem, have
been rearranged in the standard order as follows: 6.6, 7.0, 9.7, 9.4, 7.2, 7.6, 10.0, and
9.8. Which eects are signicant?
19.24 Box and Bisgaard, (1987)17 , present an experimental study of a process for
manufacturing carbon-steel springs. The study was designed to identify process operating conditions that will minimize or possibly eliminate the cracks that have plagued
the manufactured springs. From physics, material science, and process knowledge,
the following factors and levels were chosen for the investigation, along with a 23
factorial design with no replication, for a grand total of 8 experiments. The response
of interest, y, is the proportion (in %) of the manufactured batch of springs that do
not crack.
Factor
x1 : Steel Temperature ( F )
x2 : Carbon Content (% )
x3 : Quench Oil Temp. ( F )

Low
1450
0.5
70

High
1600
0.7
120

The result of the study is shown in the data table below, presented in standard order
(the actual experiments were run in random order).
(i) Analyze the experimental results to determine which factors and interactions are
signicant. State clearly how you were able to determine signicance, complete with
standard errors of the estimated signicant main eects and interactions.
(ii) Assess the model validity.
(iii) Interpret the results of this analysis in terms of what should be done if the
objective is to increase the percentage of manufactured springs that do not crack.
17 Box, G.E.P. and S. Bisgaard, (1987). The scientic context of quality improvement,
Quality Progress, 22 (6) 5461.

852

Random Phenomena
Run
1
2
3
4
5
6
7
8

x1
1
1
1
1
1
1
1
1

x2
1
1
1
1
1
1
1
1

x3
1
1
1
1
1
1
1
1

y
67
79
61
75
59
90
52
87

19.25 Palamakula, et al. (2004)18 , carried out a study to determine a model to use
in optimizing capsule dosage for a highly lipophilic compound, Coenzyme Q10. The
study involved 3 factors, Limonene, Cremophor, and Capmul, in a Box-Behnken
design; the primary response was the cumulative percentage of drug released after
5 minutes (even though the paper reported 4 other responses which served merely
as constraints). The low and high settings for each variable are shown below (the
middle setting is exactly halfway in between).
Factor
Limonene (mg)
Cremophor EL (mg)
Capnmul GMO50 (mg)

Low
18
7.2
1.8

High
81
57.6
12.6

The results of interest (presented in standard order) are shown in the table below.
Generate a design, analyze the results and decide on a model that contains only
statistically signicant parameters. Justify your decision. After you have completed
your analysis, compare your ndings with those reported in the paper.
Run
Std
Order
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Response
y SD
44.4 29.8
6 9.82
3.75 6.5
1.82 1.07
18.2 8.73
57.8 9.69
68.4 1.65
3.95 3.63
58.4 2.56
24.8 5.80
1.60 1.49
12.1 0.84
81.2 9.90
72.1 7.32
82.06 10.2

19.26 The following data set is from an experimental study reported in Garge,
18 Palamakula, A., M.T.H. Nutan and M. A. Khan (2004). Response Surface Methodology for optimization and characterization of limonene-based Coenzyme Q-10 selfnanoemulsied capsule dosage form. AAPS PharmSciTech, 5 (4), Article 66. (Available at
http://www.aapspharmscitech.org/articles/pt0504/pt050466/pt050466.pdf.)

Design of Experiments

853

(2007)19 , designed to understand the complex mechanisms involved in the reactive

extrusion process. The primary variables of interest (which dene the extruder operating conditions) are: melting zone barrel temperature, mixing zone temperature,
feed rate, screw speed, base feed composition and pulse composition. The response
variables are the Melting Energy (J) and Reaction Energy (J). To determine which
factors have statistically signicant eects on the responses, experiments were conducted using a resolution IV, 262 fractional factorial design, with the high and
low settings for the factors chosen to ensure stable extruder operation under each
operating condition. The design is shown below.
Run

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Melting
Zone
Temp
( C)
135
135
135
150
135
150
135
135
150
150
150
135
150
135
150
150

Mixing
Zone
Temp
( C)
135
135
135
150
150
150
150
135
150
135
135
150
135
150
135
150

Inlet
composition
(% E)

Screw
Speed
(RPM)

Feed
Rate
(lb/h)

Pulse
Composition
(% E)

5
5
0
5
0
0
0
0
5
5
0
5
5
5
0
0

150
250
250
250
150
250
250
150
150
150
250
150
250
250
150
150

25
25
15
25
25
15
25
15
25
15
25
15
15
15
25
15

100
10
100
100
10
10
100
10
10
100
100
10
100
100
10
100

The results of the experiments (run in random order, but presented in standard
order) are shown in the table below. For each response, analyze the data and determine which factors are signicant. Comment on the model t.

19 S. Garge, (2007). Development of an inference-based control scheme for reactive extrusion processes, PhD Dissertation, University of Delaware.

854

Random Phenomena
Run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Melting
Energy (J)
1700
1550
1300
800
800
650
650
650
1100
1000
1650
650
1100
650
1300
950

Reaction
Energy (J)
1000
300
3700
800
100
0
800
0
0
600
650
100
2100
1400
0
2150

Chapter 20
Application Case Studies III:
Statistics

20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.2 Prussian Army Death-by-Horse kicks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.2.1 Background and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.2.2 Parameter Estimation and Model Validation . . . . . . . . . . . . . . . . . . . . .
20.2.3 Recursive Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Motivation, Background and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Theory: The Bayesian MAP Estimate . . . . . . . . . . . . . . . . . . . . . . . . . .
Application: Recursive Bayesian Estimation Formula . . . . . . . . . .
Application: Recursive Bayesian Estimation Results . . . . . . . . . . .
Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.3 WW II Aerial Bombardment of London . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.4 US Population Dynamics: 1790-2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.4.1 Background and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.4.2 Truncated Data Modeling and Evaluation . . . . . . . . . . . . . . . . . . . . .
20.4.3 Full Data Set Modeling and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .
Future Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.4.4 Hypothesis Testing Concerning Average Population Growth Rate
20.5 Process Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.5.1 Problem Denition and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.5.2 Experimental Strategy and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.5.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conrmation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PROJECT ASSIGNMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. Eect of Bayesian Prior Distributions on Estimation . . . . . . .
2. First Principles Population Dynamics Modeling . . . . . . . . . .
3. Experimental Design and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
4. Process Development and Optimization . . . . . . . . . . . . . . . . . . . . .

856
857
857
858
860
860
862
864
866
867
868
870
870
872
873
874
876
878
879
879
879
879
880
883
889
889
890
890
890
891
891

A good edge is good for nothing

if it has nothing to cut.
Thomas Fuller (16081661)

As most seasoned practitioners know, there is a certain art to the analysis

of real data. It is not just that real life never quite seems to t nicely into the
neat little ideal boxes of our theoretical constructs; it is also all the little (and
not so little) wrinkles unique to each problem that make data analysis what it
is. Simplifying assumptions must be made, and strategies must be formulated
855

856

Random Phenomena

for how best to approach the problem; and what works in one case may not
necessarily work in another. But even art has its foundational principles, and
each art form its own peculiar set of tools. Whether it is capturing the detail
in a charcoal-on-paper portrait of a familiar face, the deep perspectives in an
oil-on-canvas painting of a vast landscape, or the rugged three-dimensional
contours of a sculpture, the problem at hand is what ultimately recommends
the tools. Such is the case with the three categories of problems selected for
discussion in this nal chapter in the trilogy of case studies. They have been
selected to demonstrate the broad range of applications of the theory and
principles of the past few chapters.
The rst problem is actually a pair of distinct but similarly structured
problems, staples of introductory probability and statistics courses that are
frequently used to demonstrate the powers of the Poisson distribution in
capturing the elusive character of rare events. The rst, the famous von
Bortkiewicz data set on death-by-horse kicks in the 19th century Prussian
army, involves an after-the-fact analysis of unusual death; the second, an
analysis of the aerial bombardment of London in the middle of World War
II, is marked by a cannot-waste-a-minute urgency of the need to protect
the living. One is pure analysis (to which we have added a bit of a wrinkle);
the other is a brilliant use of analysis and hypothesis testing for practical
life-and-death decision making.
The second problem involves the complete 21 decades of US census data
from 1790 to 2000. By the standards of sheer volume, this is a comparatively
modest data set (especially when compared to the data sets in the rst problem
that are at least an order of magnitude larger). For a data set that could be
analyzed in an almost limitless number of ways, it is interesting, as we show
here, how a simple regression analysis, an examination of the residuals, and
other such investigations can provide glimpses (smudges?) of the ngerprints
left by history on a humble data set consisting of only 22 entries.
The third problem is more prosaic, coming from the no-nonsense world
of industrial manufacturing. It involves process optimization, using strategic
experimental designs and data analysis to meet dicult business objectives.
But what this problem lacks in the macabre drama of the rst, or in the rich
history of the second, it tries to make up for in hard-headed practicality.
Which of these three sets of problems is the charcoal-on-paper portrait,
the oil-on-canvas landscape, or the rugged three-dimensional sculpture is left
to the readers imagination.

20.1

Introduction

Having completed our intended course of discussion on descriptive, inferential, and experimental design aspects of statistics, the primary objective of

Application Case Studies III: Statistics

857

this chapter is to present a few real-life problems to demonstrate how the concepts and ideas discussed in the preceeding chapters have been (and continue
to be) used to nd appropriate solutions to important problems. The following
is a brief catalog of the problems selected for discussion in this chapter, along
with what aspects of statistics they illustrate:
1. The Prussian army death-by-horse kicks problem involves the analysis
of the truly rare events implied in the title. It illustrates probability
modeling, the characterization of a population using the concepts of
sampling and estimation, and illustrates probability model validation.
Because the data set is a 20-year record of events happening year-byyear, we add a wrinkle to this famous problem by investigating what
could have happened had recursive Bayesian estimation been used to
analyze the data, not all at once at the end of the 20-year period, but
year-to-year. The question what is such a model useful for? is answered
by a complementary problem, similar in structure but dierent in detail.
2. The Aerial Bombardment of London in World War II problem provides
an answer to the question of practicality raised by the Poisson modeling and analysis of the Prussian army data. This latter problem picks
up where former one left o, by demonstrating how a Poisson model
and an appropriately framed hypothesis test provided the basis for the
strategic deployment of scarce anti-aircraft resources. In the historical
context of the time, solving this latter problem was anything but an act
of mere intellectual curiosity.
3. The US Population dynamics problem illustrates the power of simple
regression modeling, and how it can be used judiciously to make predictions. It also illustrates how data analysis can be used almost forensically
to nd hidden clues in data sets.
4. The Process Optimization problem illustrates the use of design of experiments (especially response surface designs) to nd optimum conditions
for achieving manufacturing objectives in an industrial process.

20.2

Prussian Army Death-by-Horse kicks

20.2.1

Background and Data

In 1898, Ladislaus von Bortkiewicz, a Russian economist and statistician

of Polish origin, published an important paper in the history of probability
modeling of random phenomena and statistical data analysis1 . The original
1 L.

von Bortkiewicz, (1898). Das Gesetz der Kleinen Zahlen, Lipzig, Teubner.

858

Random Phenomena

TABLE 20.1:

Frequency distribution of
Prussian army deaths by horse kicks
No of Deaths Number of occurrences
x
of x deaths per unit-year
(Total Frequency)
0
109
1
65
2
22
3
3
4
1
Total
200

paper contained data from a 20-year study, from 18751894, of 14 Prussian

cavalry units, recording how many members of each cavalry unit died from
a horse kick. Table 20.1 shows the popularized version of the data set, the
frequency distribution of 200 observations (data from 10 units in 20 years).
It is based on 10 of the 14 corps, after R.A. Fisher (of the F -distribution,
and ANOVA) in 1925 removed data from 4 corps because they were organized
dierently from the others.
In the original publication, von Bortkiewicz used the data to demonstrate
that rare events (with low probability of occurrence), when considered in large
populations, tend to follow the Poisson distribution. As such, if the random
variable, X, represents the total number of deaths 0, 1, 2, 3, . . . recorded by the
Prussian army over this period, it should be very well-modeled as a Poisson
random variable with the pdf:
f (x) =

x e
x!

(20.1)

where the parameter is the mean number of deaths recorded per unit-year.
Observe that the phenomena underlying this problem t those stated for
the Poisson random variable in Chapter 8: the variable of interest is the total
number of occurrences of a rare event; the events are occurring in a xed
interval of time (and location); and they are assumed to occur at a uniform
average rate. The primary purpose here is twofold: to characterize this random
variable, X, by determining the dening population parameter (in this case,
the Poisson mean number of deaths per unit-year), and to conrm that the
model is appropriate.
The data shown above is clearly from an observational study. No one
designed the experiment, per se (how could one?); the deaths were simply
recorded each year for each cavalry unit as they occurred. Nevertheless, there
is no evidence to suggest that the horses and their victims came in contact in
any other way than randomly. It is therefore reasonable to consider the data
as a random sample from this population of 19th century cavalry units.

Application Case Studies III: Statistics

859

TABLE 20.2:

Actual vs Predicted Frequency distribution

of Prussian army deaths
No of Deaths Number of occurrences
x
of x deaths per unit-year
Predicted
(Total Frequency)
Total Frequency
0
109
108.7
1
65
66.3
2
22
20.2
3
3
4.1
4
1
0.6
5
0
0.1
6
0
0.0
Total
200
200

20.2.2

Parameter Estimation and Model Validation

If the data set is considered a random sample, then the maximum likelihood
estimate of the parameter, , is the sample average, which is obtained from
the data presented in Table 20.1 as:
x

=
=

(0 109) + (1 65) + (2 22) + (3 3) + (4 1)

200
122
= 0.61
200

(20.2)

Thus, the point estimate is

= 0.61

(20.3)

and with sample variance s = 0.611, the standard error of the estimate is
SE() = 0.041

(20.4)

The 95% condence interval estimate is therefore

= 0.61 1.96 0.041 = 0.61 0.081

(20.5)

From the point estimate, the predicted relative frequency distribution, f(x),
is therefore obtained according to:
0.61xe0.61
f(x) = f (x| = 0.61) =
x!

(20.6)

The corresponding predicted total frequencies (for n = 200), may now be

computed as nf(x); the result is shown alongside the original data in Table
20.2, from where the agreement between data and model is seen to be quite
good.
To quantify how well the Poisson model ts the data, we may now carry
out the Chi-squared goodness-of-t test using MINITAB (recall the discussion

860

Random Phenomena

in Chapter 17). The result is shown below.

Goodness-of-Fit Test for Poisson Distribution
Data column: X
Frequency column: Actual
Poisson mean for X = 0.61
Poisson
Contribution
X
Observed Probability Expected
to Chi-Sq
0
109
0.543351
108.670
0.001001
1
65
0.331444
66.289
0.025057
2
22
0.101090
20.218
0.157048
>=3
4
0.024115
4.823
0.140417
N
DF Chi-Sq P-Value
200 2 0.323524
0.851
1 cell(s) (25.00%) with expected value(s) less than 5.
With a computed chi-squared statistic, C 2 = 0.323, being so small, and
an associated p-value of 0.851, i.e.,
P (2 (2) > 0.323) = 0.851

(20.7)

we have no evidence to support rejecting the null hypothesis (at the signicance level of 0.05) and therefore conclude that the model provides an
adequate t to the data. Note that the last two frequency groups corresponding to X = 3 and X = 4 had to be combined for the chi-squared test; and
even then, the expected frequency, nf = 4.823 fell just short of the required
5. MINITAB identies this and prints out a warning. However, this is not
enough to invalidate the test.
A graphical representation of the observed versus predicted frequency,
along with a bar graph of the individual contributions to the chi-squared
statistic from each frequency group is shown in Fig 20.1.

20.2.3

Recursive Bayesian Estimation

Motivation, Background and Data

As good a t as the Poisson model provided to the Prussian army data,
it should not be lost on the reader that it took 20 years to accumulate the
data. That it took so long to acquire this now famously useful data set is
understandable: by denition, stable analysis of such rare-events phenomena
requires the accumulation of sucient data to allow enough occurrences of
these rare events. This brings up an interesting idea: is it possible to obtain
parameter estimates sequentially from this data set as it is being built up, from
year-to-year, so that analysis need not wait until all the data is available in its
entirety? The obvious risk is that such year-to-year estimates will be unreliable

Application Case Studies III: Statistics

861

Chart of Observed and Expected Values

120

Expected
Observ ed

100

Value

0
x

>=3

Chart of Contribution to the Chi-Square Value by Category

0.16

Contributed Value

0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00

>=3

FIGURE 20.1: Chi-Squared test results for Prussian army death by horse kicks data
and a postulated Poisson model. Top panel: Bar chart of Expected and Observed
frequencies; Bottom Panel: Bar chart of contributions to the Chi-squared statistic.

862

Random Phenomena

TABLE 20.3:
breakdown
Unit
Year
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Year-by-Year, Unit-by-Unit
of Prussian army deaths data
1 2 3 4 5 6 6 8 9 10
1
0
0
0
0
0
1
0
1
0
1
1
1
0
1
1
0
0
2
0

0
1
1
1
0
0
1
1
0
0
1
0
0
0
1
0
1
1
0
0

2
0
0
3
2
2
1
0
1
2
0
1
0
0
0
2
0
2
1
0

0
0
0
1
0
0
1
2
0
0
3
1
0
0
1
1
2
2
0
1

0
1
2
0
1
1
0
0
1
1
0
0
2
0
0
0
1
0
0
0

0
0
0
0
4
0
1
0
0
0
2
0
0
0
0
1
0
1
2
0

1
0
1
1
0
2
1
1
3
0
1
0
1
0
0
1
2
0
1
0

0
1
2
0
2
1
0
0
0
1
0
0
1
1
2
1
0
0
0
1

0
0
0
0
1
0
0
0
1
0
0
2
2
1
0
0
1
0
0
1

0
0
0
1
0
1
1
1
1
0
0
0
2
0
0
1
1
0
1
0

Total
4
3
6
7
10
7
7
5
8
4
8
5
9
2
5
8
8
6
7
3

because they will be based on sparse and information-decient data, at least

until sucient, reliably stable information has accumulated in the growing
data set. Still, is it even possible to obtain such sequential estimates? And if
so, what trajectory will these interim estimates take in their approach to
the nal full-data set estimate?
The answer to the rst question is yes; and Bayesian estimation is one technique for implementing such a sequential estimation strategy. And to provide
a specic answer to the second question, we now present a Bayesian procedure
for sequentially estimating the Poisson parameter, simulating how the analysis
might have progressed in real-time, year-by-year, if one imagined that the
data became available every year as shown in the Table 20.3 for the 10 units.
Theory: The Bayesian MAP Estimate
Bayesian estimation, as we may recall, requires that we rst specify a
prior distribution to be combined with the sampling distribution to obtain the
posterior distribution, from which the parameter estimates are then obtained.
In Chapter 14, we recommended the use of conjugate priors wherever possible.
For this particular problem involving a Poisson random variable, the Gamma
distribution is the conjugate prior for estimating the Poisson parameter. Thus,
if we consider as a prior distribution for this unknown parameter, , the gamma

Application Case Studies III: Statistics

863

(a, b) distribution, i.e.,

f () =

1
a1 e/b
ba (a)

(20.8)

we can then obtain the posterior distribution, f (|x1 , x2 , . . . , xn ), after combining this prior with the sampling distribution for a random sample, drawn
from the Poisson distribution. From this posterior distribution, we will use
the maximum `
a-posteriori (MAP) estimate as our choice for the parameter
estimate.
Now, the sampling distribution for a random sample, X1 , X2 , . . . , Xn , from
the Poisson distribution is given by:

n
( i=1 xi ) en
f (x1 , x2 , . . . , xn |) =
x1 !x2 ! . . . xn !

(20.9)

By Bayes theorem, using the given prior, the posterior distribution is:
f (|x1 , x2 , . . . , xn ) = C

1
(a1+ ni=1 xi ) e(n+1/b) (20.10)

ba (a)x1 !x2 ! . . . xn !

where C is the usual normalizing constant. The MAP estimate is obtained

from here by taking derivatives with respect to , equating to zero and solving;
the same results are obtained by maximizing ln f ; i.e.,

n

1
ln f
1
=
a1+
xi n +
(20.11)

b
i=1
which, upon equating to zero and solving, gives the result:
n
( i=1 xi ) + (a 1)

MAP =
n + 1b

(20.12)

Of course, depending on the value specied for a and b in the prior distribution, Eq (20.12) will return dierent estimates for the unknown parameter.
In particular, observe that for a = 1 and b = , the result is precisely the
same as the sample average. This is one of the main criticisms of Bayesian estimation: that the subjectivity inherent in the choice of the prior distribution
introduces bias into the estimate.
However, it can be shown that this bias is traded o for a smaller parameter
estimate variance. Furthermore, when used recursively, whereby the posterior
distribution at the current time is used as the prior distribution in the next
round, the bias introduced at the beginning by the original prior distribution
progressively washes out with each iteration. Thus, in return for the initial
bias, this recursive strategy allows one to carry out estimation sequentially
without having to wait for the full data set to be completely accumulated,
obtaining more reliable estimates along the way.

864

Random Phenomena

0.8
0.7
0.6

0.5
0.4
Gamma(2,0.5)

0.3
0.2
0.1
0.0

FIGURE 20.2: Initial prior distribution, a Gamma (2,0.5), used to obtain a Bayesian
estimate for the Poisson mean number of deaths per unit-year parameter.

It is possible to show that if the data sequence is stable, the Bayesian

estimates, (j) , obtained at the j th iteration, will converge to the true value
of in the limit as j . We now proceed to show, using the data in Table
20.3, just how quickly the Bayesian estimate converges to the true estimate
of the mean number of deaths per unit-year.
Application: Recursive Bayesian Estimation Formula
Let us consider that at the beginning of the data gathering, knowing nothing more than what may have transpired in the past, it seems reasonable to
suppose that the mean number of deaths per unit-year will be a number that
lies somewhere between 0 and 4 (few units report deaths exceeding 4 in a year
and, since these manner of deaths are rare events, many units will report no
deaths). As a result, we start by selecting a prior distribution, Gamma (2, 0.5),
whose pdf is shown in Fig 20.2. From the plot, we observe that this prior indeed
expresses the belief that the unknown mean number of deaths per unit-year
lies somewhere between 0 and 4. The distribution is fairly broad, indicating
our fairly substantial level of uncertainty about this unknown parameter. The
probability assigned to the higher numbers may be small, in keeping with
common sense assessment of the phenomenon, but these probabilities are not
zero.
Next, given this prior and the 10 data points for the rst year in the rst
row of the data table,
x1 = 1, x2 = 0, . . . , x10 = 0
we can now use Eq (20.12) to obtain (1) , the MAP estimate at the end of the

Application Case Studies III: Statistics

10
rst year. With a = 2, b = 0.5, and i=1 xi = 4, we obtain
5
= 0.4167
(1) =
12

865

(20.13)

as the MAP estimate at the end of the rst year.

In general, if (k) is the MAP estimate obtained after the k th year, (i.e.,
after a total of k-years worth of data), it is possible to establish that:
'

(
10
x
(k+1) (k)
(20.14)
(k+1) = (k) +

10k + 10 + b
where b = 1/b, and x
(k+1) is the arithmetic average of the 10 numbers constituting the year (k + 1) data, dened by:
x
(k+1) =

1
10

10k+10

(20.15)

i=10k+1

This result is established as follows: From the general expression in Eq (20.12),

we obtain that (k+1) is given by:

10k+10
xi + a 1
i=1
(k+1) =
(20.16)
10k + 10 + b
since, in this case, the total number of data points used by the (k + 1)th year
is 10k + 10, with b = 1/b. It is convenient to rearrange this equation as:
10k+10

(k+1)
(10k + 10 + b )
=
xi + a 1
(20.17)
i=1

Similarly, for the k th year, we obtain

(10k + b )(k) =

10k

+a1

(20.18)

i=1

Subtracting Eq (20.18) from Eq (20.17) gives:

(10k + 10 + b )(k+1) = (10k + b )(k) +

10k+10

(20.19)

i=10k+1

which may be rearranged to give:

(10k + 10 + b )(k+1) = (10k + 10 + b )(k) 10(k) + 10
x(k+1)
with
x
(k+1) =

1
10

10k+10

i=10k+1

(20.20)

866

Random Phenomena

TABLE 20.4:

Recursive
(yearly) Bayesian estimates of the
mean number of deaths per
unit-year
After
After
year j
(j)
year j
(j)
1
0.4167
11
0.6250
12
0.6148
2
0.3636
3
0.4375
13
0.6364
4
0.5000
14
0.6056
15
0.5987
5
0.5962
6
0.6129
16
0.6111
7
0.6250
17
0.6221
18
0.6209
8
0.6098
19
0.6250
9
0.6304
10
0.6078
20
0.6089

from where the required result is easily established:

'
(
10
(k+1)
(k)
x

(k+1) = (k) +

10k + 10 + b

(20.21)

(This expression is reminiscent of the expression for the recursive least squares
estimate given in Eq (16.190) in Chapter 16.)
Application: Recursive Bayesian Estimation Results
Beginning with (1) = 0.4167 obtained above, the recursive expression in
Eq (20.14) may now be used along with the supplied data; the result is the
sequence of MAP estimates, (j) , after each j th year, shown in Table 20.4 for
j = 2, 3, . . . , 20.
A plot of these recursive estimates is shown in Fig 20.3 in the solid line,
with the dashed line representing the single estimate (0.61) obtained earlier
using the entire 20-year data all at once. Observe now that an argument could
have been made for stopping the study sometime after the 5th or 6th year since,
by then, the recursive estimate has essentially and observably settled down
to within a small tolerance of the nal value of 0.61. To be sure that enough
time would have elapsed to ascertain that the value has truly settled down,
the year 7 or 8 or even up to year 10 would all be good recommendations
for the stopping year. The point is that this recursive method would have
provided a stable estimate close to the nal one long before the 20 years have
elapsed. While the observed convergence to the maximum likelihood estimate
demonstrates the washing out of the initial prior distribution, it will still
be interesting to determine the nal posterior distribution and compare it to
the original prior distribution.
It can be shown (and this is left as an exercise to the reader) that the nal

Application Case Studies III: Statistics

867

Recursive Bayesian Estimation

0.65

Variable
Thetahat_j
Theta_MLE

0.60

(j)

0.55
0.50
0.45
0.40
0.35
0

10
15
Year, j (from 1874)

FIGURE 20.3: Recursive Bayesian estimates using yearly data sequentially, compared
with the standard maximum likelihood estimate, 0.61, (dashed-line).
posterior distribution is:
f (|x1 , x2 , . . . , xn ) = C1 123 e202

(20.22)

where C1 is the normalizing constant for a gamma (124, 1/202) or,

C1 =

202124
(124)

(20.23)

A plot of this distribution along with the prior gamma (2, 0.5) distribution is
shown in Fig. 20.4. The key characteristic of this posterior distribution is that
it is very sharply concentrated around its mean value of 0.614, and mode of
0.619, both of which are practically the same as the maximum likelihood value
of 0.61 obtained earlier. This posterior distribution stands in very sharp contrast to the very broad prior distribution, demonstrating how the information
obtained recursively from the accumulating data had reduced the uncertainty
expressed in the initial prior distribution.
Final Remarks
The reader will be forgiven for asking: what is the point of this entire
exercise? After all, the soldiers whose somewhat undignied deaths (undignied, that is, for professional soldiers) have contributed to the data will not
be coming back as a result of this exercise. Furthermore, it is not as if (at
least as far as we know) the original analysis led to a change in the Prussian
army policy that yielded improved protection for the soldiers against such
deaths. A Poisson model t the data well; the model parameter has been

868

Random Phenomena

8
7
Final Posterior,
Gamma(124,0.005)

5
4
3
2
1
Prior, Gamma(2,0.5)
0

FIGURE 20.4: Final posterior distribution (dashed line) along with initial prior distribution (solid line).

estimated as roughly 0.6, implying that on average, there will be 6 deaths

per 10 unit-years. But what else can such a model be used for? Admittedly,
the original analysis was more just for the exercise, to demonstrate, at the
time, that the Poisson pdf is the model of choice for such phenomena. Without the historical perspective, this may appear like a pointless exercise to
modern day practitioners. But with the next problem, we show how a similar
Poisson model provided the crucial tool for deciding where to deploy scarce
anti-aircraft artillery.

20.3

WW II Aerial Bombardment of London

At a critical point in World War II, after London had suered a devastating
period of sustained aerial bombardment, the British government had to make
some strategic decisions about where to deploy limited anti-aircraft artillery
for maximum eectiveness. How British researchers approached and solved
this problem provides yet another example of how random phenomena are
subject to rigorous analysis and illustrates how probability and statistics can
be used eectively for solving important real-life problems.
In his 1946 paper2 about this problem, R. D. Clarke noted:
2 Clarke, R. D. (1946). An application of the Poisson Distribution, J Inst. Actuaries,
72, 48-52.

Application Case Studies III: Statistics

869

TABLE 20.5:

Frequency distribution of bomb hits in

greater London during WW II and Poisson model prediction
No of bomb hits
Number of regions
Poisson Model
x
sustaining x bomb hits
Prediction
0
229
227
1
211
211
2
93
99
3
35
31
4
7
7
5
0
1
6
0
0
7
1
0
Total
576
576

During the ying-bomb attack on London, frequent assertions

were made that the points of impact of the bombs tended to be
grouped in clusters. It was accordingly decided to apply a statistical test to discover whether any support could be found for this
allegation.
To determine whether or not the aerial bombardment was completely haphazard with no pattern whatsoever, the greater London terrain was sub-divided
into a 24 24 grid, or 576 small square regions, each 0.25 square kilometers
in size, and the number of bomb hits in each small region tallied. If the hits
are random, a Poisson distribution should t the data. Indeed, as the the data
and the Poisson model prediction in Table 20.5 show, this appears to be the
case.
The Poisson parameter estimate, the sample average, obtained as

is used in the model

= 0.932

(20.24)

x e

f(x) =
x!

(20.25)

to obtain the model prediction shown. A Chi-squared goodness-of-t test (not

shown; left as an exercise to the reader) indeed conrms that the model is a
good t. But our task is not yet complete.
The real issue to be resolved, we must keep in mind, is whether or not
certain regions were targeted for extra bombing. Identifying such regions will
be crucial to the strategic deployment of anti-aircraft artillery. With this in
mind, we draw attention to the entry for x = 7, indicating that one region
received seven bomb hits. Observe that no region received 6 hits or even 5
hits; and 7 regions received four hits each (predicted perfectly by the Poisson
model). This all makes one wonder: what are the chances that one region out
of 576 will receive seven or more hits purely at random?

870

Random Phenomena

This is where the validated model becomes useful, since we can use it to
compute this particular probability as
P (X 7| = 0.93) = 0.000054

(20.26)

We have retained all these decimal places to make a point. Cast in the form of
a hypothesis test, this statement declares that at the 0.05 signicance level, for
any region to receive 7 or more hits is either the absolute rarest of rarest events,
(about one in 20,000 tries) or the region in fact must have been deliberately
targeted. The anti-aircraft artillery were therefore deployed in this location,
and the rest is history.

20.4
20.4.1

US Population Dynamics: 1790-2000

Background and Data

The US Populationto the nearest millionas determined from census

records every decade from 1790 to 2000, is shown in Table 20.6. The primary
purpose of this case study is to use this data set to develop a model of how
the US population has changed over the last two centuries, analyze the model
for any insight, and then use it to predict the next observation in the series,
the 2010 census result.
We begin by noting that this data set belongs to the category known as
time-series data, where each observation occurs in a sequential series in
time. The distinguishing characteristic of such data is the most obvious: the
observations are correlated in time. If for no other reason, at least we know
from natural population growth fundamentals in living organisms that the size
of a population at the current time is a function of the size in the previous
time period. This intrinsic correlation endows such data sets with properties
that call for special time-series analysis tools; but these are beyond the
scope of this book. Such data sets can also be analyzed using population
growth models based on biology; but this requires making some fairly strong
assumptions about the net population growth rate. Such assumptions may
be dicult to justify, especially for a country with such complex immigration
patterns as the US, and for a period spanning 210 years.
It is also possible, however, to use the tools we have discussed thus far,
and this is what we plan to do in this section. For example, let us recall that
a scatter plot of this very data set was shown in Chapter 12 (Fig 12.18).
While the plot has the typical exponential growth characteristics of healthy
populations, it also suggests that a quadratic regression model might be a
good starting point. Fitting a quadratic model of the type
y = 0 + 1 x + 2 x2 +

(20.27)

Application Case Studies III: Statistics

TABLE 20.6:

US Population (to
the nearest million) from 17902000
Census Year Population (millions)
1790
4
1800
5
1810
7
1820
10
1830
13
1840
17
1850
23
1860
31
1870
40
1880
50
1890
63
1900
76
1910
92
1920
106
1930
123
1940
132
1950
151
1960
179
1970
203
1980
227
1990
249
2000
281

871

872

Random Phenomena

is a fairly straightforward task that we will get to in a moment. For now,

because one of the objectives is to predict the next observation (the 2010
census result), we must be conscious of the fact that using a regression model
to predict outside the data range is usually not a good idea.
Therefore, we propose to evaluate, rst, how useful the regression model
can be as a one-step and/or multiple-step ahead predictor. We plan to do this
in an objective fashion as follows: we will truncate the data at 1970 and use
this deliberately abbreviated data set to obtain a regression model that will
then be used to predict the missing 1980, 1990, and 2000 population results.
How well the model predictions match the actual census data will provide an
objective assessment of how reasonable the regression approach can be in this
regard.

20.4.2

Truncated Data Modeling and Evaluation

Let us dene a normalized year variable as:

Census Year 1790

+1
10

(20.28)

which essentially assigns the natural numbers 1, 2, etc to the census years, so
that the rst year, 1790 is 1, the second, 1800 is 2, etc. A regression analysis
of the truncated data (only up to x = 19, or 1970) in MINITAB produces the
following results. The regression model itself is:
yt = 6.14 1.86x + 0.633x2

(20.29)

where yt is the truncated-data model prediction, with the detailed

MINITAB output shown below.
Regression Analysis: Population-t versus Xt, Xt2
The regression equation is
Population-t = 6.14 - 1.86 Xt + 0.633 Xt2
Predictor
Coef SE Coef
T
P
Constant
6.137
2.139 2.87 0.011
Xt
-1.8630
0.4924 -3.78 0.002
Xt2
0.63254 0.02392 26.44 0.000
S = 2.78609 R-Sq = 99.8% R-Sq(adj) = 99.8%
These results indicate that all three parameters are signicantly dierent from
zero at the = 0.05 level, and that the regression model explains a signicant
amount of the variability in the data to the tune of 99.8% without using
an excessive number of parameters.
Using MINITABs New Observation prediction feature produces the following results for the model prediction of the 1980 (Obs 1), 1990 (Obs 2), and

Application Case Studies III: Statistics

873

2000 (Obs 3) census results, respectively.

New
Obs
Fit SE Fit
95% CI
95% PI
1
221.892
2.139 (217.358, 226.425) (214.446, 229.337)X
2
245.963
2.607 (240.437, 251.488) (237.874, 254.051)XX
3
271.299
3.131 (264.660, 277.937) (262.413, 280.184)XX
X denotes a point that is an outlier in the predictors.
XX denotes a point that is an extreme outlier in the predictors.
Note how MINITAB ags all three predictions as outliers since they
truly lie outside of the data range. Nevertheless, we now observe that the true
values, y1980 = 227 and y1990 = 249, fall nicely within the respective 95%
prediction intervals, and the true value, y2000 = 281, falls just outside the
high limit of the prediction interval. The implication is that, all things being
equal, a regression model based on the full data set should be able to provide
an acceptable one-step ahead prediction.

20.4.3

Full Data Set Modeling and Evaluation

Repeating the entire exercise, this time using the full census data set,
produces the following result. This time, the model is,
y = 7.92 2.47x + 0.668x2

(20.30)

where we note that the model coecients have not changed too drastically.
The rest of the MINITAB output is shown below.
Regression Analysis: Population versus Xn, Xn2
The regression equation is
Population = 7.92 - 2.47 Xn + 0.668 Xn2
Predictor
Coef
SE Coef
T
P
Constant
7.916
2.101 3.77 0.001
Xt
-2.4735
0.4209 -5.88 0.000
Xt2
0.66763 0.01777 37.57 0.000
S = 2.99130 R-Sq = 99.9% R-Sq(adj) = 99.9%
Once again, the parameter estimates are seen to be signicant; and the R2
2
and Radj
values have even improved slightly. The ANOVA table (not shown)
does not show anything out of the ordinary. A plot of the data, the regression
model t, along with both the 95% condence interval and the 95% prediction
interval, is shown in Fig 20.5.
This gure seems to show that the t is particularly good, with very little
uncertainty around the model prediction, as implied by the very tight condence and prediction intervals. However, MINITAB agged two residuals that

874

Random Phenomena
Fitted Line Plot
Population = 7.916 - 2.474 XNormYear
+ 0.6676 XNormYear**2
300

Regression
95% CI
95% PI

250

S
R-Sq
R-Sq(adj)

Population

200

2.99130
99.9%
99.9%

150
100
50
0
0

10
15
XNormYear

FIGURE 20.5: Quadratic regression model t to US Population data along with both
the 95% condence interval and the 95% prediction interval.

appear unusually large:

Unusual Observations
Obs
Xn Population
Fit SE Fit Residual St Resid
16
16.0
132.000 139.253 0.859
-7.253
-2.53R
17
17.0
151.000 158.811 0.863
-7.811
-2.73R

R denotes an observation with a large standardized residual.

The usual (standardized) residual plots shown in Fig 20.6 do in fact indicate that the residuals do not look normally distributed at all; if anything,
they look serially correlated (not surprising). In particular, the two observations agged by MINITAB are marked in the top panel of this gure. These
observations belong to the census years 1940 and 1950, with the residuals
indicating that the model signicantly over-estimated the populations during these years (in other words, according to this model, the true population
count in 1940 and in 1950 were signicantly lower than expected). It is left as
an exercise to the reader to suggest possible explanations for what could be
responsible for a lower-than-expected population count in 1940 and 1950.
Future Prediction
Despite the unusual residuals, such a simple model provides a surprisingly
reasonable representation of the census data. Using the model to predict the

Application Case Studies III: Statistics

875

Versus Order
(response is Population)
1940 1950

Standardized Residual

2
1

-1
-2

-3
2

10
12
14
Observation Order

Normal Probability Plot

(response is Population)
99

95
90

Percent

80
70
60
50
40
30
20
10
5

-3

-2

-1

0
1
Standardized Residual

FIGURE 20.6: Standardized residuals from the regression model t to US Population

data. Top panel: Residuals versus observation order; Bottom panel: Normal probability
plot. Note the left-over pattern indicative of serial correlation, and the unusual observations identied for the 1940 and 1950 census years in the top panel; note also the
general deviation of the residuals from the theoretical normal probability distribution line
in the bottom panel.

876

Random Phenomena

2010 census result produces the following MINITAB result.

New
Obs
Fit SE Fit
95% CI
95% PI
1
304.201
2.101 (299.803, 308.600) (296.550, 311.853)X
The implication is that, to the nearest million,
y2010 = 304; and 297 < y2010 < 312

(20.31)

a point estimate of 304 million along with the indicated 95% prediction interval. We personally believe that this probably underestimates what the true
2010 census result will be. The potential unaccounted-for phenomena that are
likely to aect this prediction include but are not limited to:
The increasingly complex immigration patterns over the past 20 years;
The changing mean rate of reproduction among recent immigrants and
among long-term citizens;
The inuence of medical advances on life expectancy.
The reader is encouraged to think of any other potential factors that may
contribute to rendering the prediction inaccurate.

20.4.4

Hypothesis Testing Concerning Average Population

Growth Rate

It is dicult to imagine that the average population growth rate would

have remained constant from decade to decade from 1790 until modern times.
To investigate this phenomenon, we generate from the census data, a table of
percent relative population growth rate from 1800-2000 dened as follows:
(Y ) =

P (Y ) P (Y 10)
100%
P (Y 10)

(20.32)

where P (Y ) is the population recorded in year Y . The resulting 21 values may

be divided into 3 even periods of 70 years each, as follows: the period from
1800-1860 is designated as Period 1, 1870-1930 as Period 2, and 1940-2000
as Period 3. The result is shown in Table 20.7, indicating, for example, that
from 1790 to 1800, the US population experienced an average relative growth
rate of 25% from 4 million to 5 million, etc.
The postulate to be tested is that the average population growth rate has
remained essentially constant over each of the three periods. Fig 20.7 is a plot
of this relative growth rate against census year; it shows many of distinctive
features, but the most obvious is that the high growth rate of the early Period 1
gave way to a decidedly sharp decline beginning in 1860, perhaps triggered by
the Civil War. The decline appears to end in 1940, at the beginning of Period
3, a period marked by a steady increase in relative growth rate through 1960,

Application Case Studies III: Statistics

877

TABLE 20.7:

Percent average
relative population growth rate for each
census year from 18002000 divided
into three 70-year periods
Census
Average Rel.
Year
GrowthRate (%) Period
1800
25.0000
1
1810
40.0000
1
1820
42.8571
1
1830
30.0000
1
1840
30.7692
1
1850
35.2941
1
1860
34.7826
1
1870
29.0323
2
1880
25.0000
2
1890
26.0000
2
1900
20.6349
2
1910
21.0526
2
1920
15.2174
2
1930
16.0377
2
1940
7.3171
3
1950
14.3939
3
1960
18.5430
3
1970
13.4078
3
1980
11.8227
3
1990
9.6916
3
2000
12.8514
3

1860

1930 1945 1960

Growth Rate (%)

35
30
25
20
15
10
1800

1850

1900
Year

1950

2000

FIGURE 20.7: Percent average relative population growth rate in the US for each
census year from 1800-2000 divided into three equal 70-year periods. Period 1: 18001860; Period 2: 1870-1930; Period 3: 1940-2000.

878

Random Phenomena
Normal Probability Plot
(response is GrowthRate)
99

95
90

Percent

80
70
60
50
40
30
20
10
5

-10

-5

0
Residual

FIGURE 20.8: Normal probability plot for the residuals from the ANOVA model for
Percent average relative population growth rate versus Period with Period 1: 1800-1860;
Period 2: 1870-1930; Period 3: 1940-2000.

perhaps driven by the baby boom. The intermediate year 1945 (during WW
II) and the census year, 1960, are marked on the graph for reference.
A formal one-way ANOVA test of equality of the average relative growth
rates for these 3 periods shows the following not-too-surprising result:
Results for: USPOPULATION.MTW
One-way ANOVA: GrowthRate versus Period
Source DF
SS
MS
F
P
Period
2 1631.9 816.0 32.00 0.000
Error
18 458.9 25.5
Total
20 2090.9

S = 5.049 R-Sq = 78.05% R-Sq(adj) = 75.61%

The indication is, of course, that there is a signicant dierence in the
average relative growth rates in each period. A normal probability plot of the
residuals from the one-way ANOVA model (after the Period eects, the
average relative growth rates for each period, have been estimated) is shown
in Fig 20.8. Visually, the normality assumption appears to be reasonable.
As an exercise, the reader should take a closer look at the complete average percent population growth rate data from 1800-2000 in Table 20.7 (and
the plot in Fig 20.7), and interpret any observable features from the perspective of US History and other contemporary trends.

Application Case Studies III: Statistics

20.5

879

Process Optimization

20.5.1

Problem Denition and Background

This nal problem involves improving the overall performance of a twostage commercial coating process.3 The primary material used in the coating
process is produced in the rst stage where manufacturing yield (measured
in %) is the key response variable of interest. In the second stage, other additives are compounded with the primary material and the coating process
completed; the primary response variable is adhesion, measured in grams.
To meet customer specications and remain protable requires yields of
91% or greater and adhesion greater than 45 grams. However, the process
consistently failed to meet these objectives, and an experimental program
was launched with the aim of nding process variable settings at which the
specied objectives would be met.

20.5.2

Experimental Strategy and Results

Planning
With the response variables identied as y1 , Yield (%), and y2 , Adhesion
(grams), a thorough consideration of all aspects of the process led to the
following list of seven potential factorsindependent process variables that
could potentially aect yield and adhesion:
1. Amount of additive
2. Raw material supplier
3. Reactor conguration
4. Reactor level
5. Reactor pressure
6. Reactor temperature
7. Reaction Time
As a result, the following overall strategy was devised: rst, a set of screening
experiments will be performed to determine which of these seven variables are
important factors; next, a set of optimization studies will be carried out to
determine the best settings for the important factors; and nally, the optimum
setting will be veried in a set of conrmation experiments.
These considerations led to the choice of a 273
IV fractional factorial design
for the screening, followed by a response surface design for optimization.
3 Adapted from an example used in an E.I. duPont de Nemours and Companys Central
Research & Development training course; the original source is unknown.

880

Random Phenomena

TABLE 20.8:

Response surface design and experimental

results for coating process
RunOrder Additive
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Time

0
70
35
0
70
35
0
35
35
35
35
35
35
70
35
0
70
35
0
70

Temperature Yield

20
20
40
60
60
20
40
40
40
40
40
40
40
40
60
20
20
40
60
60

100
100
100
100
100
140
140
140
140
140
140
140
140
140
140
180
180
180
180
180

68
51
75
81
65
80
68
82
87
87
82
85
85
75
92
40
75
77
50
90

Adhesion
3
40
31
10
48
38
24
37
41
40
40
42
42
44
41
37
31
44
40
39

Design and Implementation

The results of the fractional factorial experiments (not shown) identied
the following three signicant factors, along with appropriate low and high
settings:
1. x1 : Amount of additive; (0, 70);
2. x2 : Reaction time; (20, 60) mins;
3. x3 : Reactor temperature; (100, 180) C.
A three-factor face-centered cube response surface design was therefore used
for the optimization experiments. The design (the standard design of 17 runs
plus an additional set of center point replicates) and the experimental results
are shown in Table 20.8.

20.5.3

Analysis

With the design and data in a MINITAB worksheet, the data analysis is carried out with the sequence: Stat > DOE > Response Surface >

Application Case Studies III: Statistics

881

Analyze Response Surface Design > which opens a self-explanatory dialog box. Upon selecting the appropriate options, the following results are obtained, rst for Yield:
Response Surface Regression: Yield versus Additive, Time, Temperature
The analysis was done using coded units.
Estimated Regression Coefficients for Yield
Term
Coef SE Coef
T
P
Constant
84.5455
0.6964 121.403 0.000
Additive
4.9000
0.6406
7.649 0.000
Time
6.4000
0.6406
9.991 0.000
Temperature
-0.8000
0.6406
-1.249 0.240
Additive*Additive
-12.8636
1.2216 -10.530 0.000
Time*Time
1.6364
1.2216
1.340 0.210
Temperature*Temperature -8.3636
1.2216
-6.847 0.000
Additive*Time
0.7500
0.7162
1.047 0.320
Additive*Temperature
13.5000
0.7162 18.849 0.000
Time*Temperature
-0.2500
0.7162
-0.349 0.734
S = 2.02574 PRESS = 184.619
R-Sq = 98.92% R-Sq(pred) = 95.13% R-Sq(adj) = 97.94%
Estimated Regression Coefficients for Yield using data in uncoded
units
Term
Coef
Constant
7.87273
Additive
-0.517792
Time
-0.00102273
Temperature
1.11864
Additive*Additive
-0.0105009
Time*Time
0.00409091
Temperature*Temperature -0.00522727
Additive*Time
0.00107143
Additive*Temperature
0.00964286
Time*Temperature
-3.12500E-04
(The ANOVA tablenot showndisplays the typical break down of the
sources of variability and establishes that the composite linear, square and
interaction terms are all signicant.)
The corresponding results for Adhesion are as follows:
Response Surface Regression: Adhesion versus Additive, Time,
Temperature
The analysis was done using coded units.

882

Random Phenomena

Estimated Regression Coefficients for Adhesion

Term
Coef SE Coef
T
Constant
40.2364
0.5847 68.816
Additive
8.8000
0.5378 16.362
Time
2.9000
0.5378
5.392
Temperature
5.9000
0.5378 10.970
Additive*Additive
-6.0909
1.0256 -5.939
Time*Time
-0.5909
1.0256 -0.576
Temperature*Temperature -2.5909
1.0256 -2.526
Additive*Time
0.7500
0.6013
1.247
Additive*Temperature
-10.2500
0.6013 -17.046
Time*Temperature
-0.5000
0.6013 -0.831

P
0.000
0.000
0.000
0.000
0.000
0.577
0.030
0.241
0.000
0.425

S = 1.70080 PRESS = 142.826

R-Sq = 98.81% R-Sq(pred) = 94.12% R-Sq(adj) = 97.74%
Estimated Regression Coefficients for Adhesion using data in uncoded
units
Term
Coef
Constant
-73.0818
Additive
1.58162
Time
0.313182
Temperature
0.882159
Additive*Additive
-0.00497217
Time*Time
-0.00147727
Temperature*Temperature -0.00161932
Additive*Time
0.00107143
Additive*Temperature
-0.00732143
Time*Temperature
-6.25000E-04
(The ANOVA table is not shown, neither is the diagnostic warning about an
unusual observation, Observation 8, with a standardized residual of 2.03.
This value is not so high as to cause serious concern.)
Upon eliminating the coecients with p-values greater than 0.05, we obtain
the following response surface models, for Yield, y1 ,
y1 = 85.55 + 4.90x1 + 6.40x2 12.86x21 8.36x23 + 13.5x1 x3

(20.33)

and for Adhesion, y2 :

y2 = 40.23 + 8.80x1 + 2.90x2 + 5.9x3 6.09x21 2.59x23 10.25x1 x3 (20.34)
in terms of coded units,
x1

A 35
35
40
20
T 140
40

(20.35)
(20.36)
(20.37)

Application Case Studies III: Statistics

883

where A =Additive, =Time, and T =Temperature, in the original units.

In terms of these original uncoded units, the response surface model equations are:
y1 = 7.873 0.518A 0.001 0.11A2 0.005T 2 + 0.010AT

(20.38)

for Yield, and

y2 = 73.082 + 1.582A + 0.313 + 0.884T 0.005A2 0.002T 2 0.007AT
(20.39)
for Adhesion. A plot of the standardized residuals versus t, and a normal
probability plot of the standardized residuals are shown for the Yield and
Adhesion responses in Figs 20.9 and 20.10, respectively; neither set of plots
shows anything that will invalidate the normality assumptions. This model
may now be used to optimize the process.
Optimization
Before engaging in any rigorous analysis, a close examination of these
model equations reveals several noteworthy features:
1. The scaled time variable, x2 , appears in each model as a single, linear term, with a positive coecient. This implies that each response is
linearly dependent on x2 in a monotonically increasing fashion. As a result, both y1 and y2 are maximized with respect to x2 at the maximum
allowable setting.
2. The dependence of y1 on x1 is through the pure linear and quadratic
terms, and the bilinear x1 x3 interaction term. The dependence of y1 on
x3 , on the other hand, is through the quadratic term and bilinear x1 x3
term; there is no separate linear dependence on x3 .
3. The dependence of y2 on x1 and x3 is through linear and quadratic terms
in each variable as well as through the bilinear x1 x3 term.
These model characteristics may be visualized graphically a number of
dierent ways, including the popular surface and contour plots. The surface
plot is a 3-dimensional plot of the predicted model response, y, as a function of
two of the independent variables. The contour plot, on the other hand, consists
of a two-dimensional collection of lines of equal values of y, as a function of
two independent variables. (The contour plot is essentially a projection of
the surface plot onto the oor of the two-dimensional independent variable
plane.)
Such a combination of surface and contour plots for Yield and Adhesion
as functions of Additive and Temperature (holding Time at the high value of
60) are shown in Figs 20.11 and 20.12. These gures show visually how each
response behaves as a function of the factors; the gures also show where the

884

Random Phenomena

Versus Fits
(response is yield)
1.5

Standardized Residual

1.0
0.5
0.0
-0.5
-1.0
-1.5
40

70
Fitted Value

100

Normal Probability Plot

(response is yield)
99

95
90

Percent

80
70
60
50
40
30
20
10
5

-2

-1

0
Standardized Residual

FIGURE 20.9: Standardized residual plots for Yield response surface model: versus
tted value, and normal probability plot.

Application Case Studies III: Statistics

885

Versus Fits
(response is Adhesion)

Standardized Residual

-1

-2
0

Fitted Value

Normal Probability Plot

(response is Adhesion)
99

95
90

Percent

80
70
60
50
40
30
20
10
5

-2

-1

0
Standardized Residual

FIGURE 20.10: Standardized residual plots for Adhesion response surface model:
versus tted value, and normal probability plot.

886

Random Phenomena

Surface Plot of Yield vs Temperature, Additive

Hold Values
Time 60

96
84
Y ield

72
60

20
40
A dditive

175
150
125 T emper atur e
60

100

Contour Plot of yield vs Temperature, Additive

180

140

y ield
< 60
70
80
90
91
92
93
94
> 94

130

Hold Values
Time 60

170

60
70
80
90
91
92
93

Temperature

160
150

120
110
100

30
40
Additive

FIGURE 20.11: Response surface and contour plots for Yield as a function of
Additive and Temperature (with Time held at 60.00).

Application Case Studies III: Statistics

887

Surface Plot of Adhesion vs Temperature, Additive

Hold Values
Time 60

A dhesion

30
175
10

150
0

125
20

A dditive

T emper atur e

100

Contour Plot of Adhesion vs Temperature, Additive

180

130

Adhesion
< 20
20 30
30 40
40 41
41 42
42 43
43 44
44 45
45 46
46 47
47 48
> 48

120

Hold Values
Time 60

170

Temperature

160
150
140

110
100

30
40
Additive

FIGURE 20.12: Response surface and contour plots for Adhesion as a function of
Additive and Temperature (with Time held at 60.00).

888

Random Phenomena

Contour Plot of Yield, Adhesion

180
170
160
Temperature

y ield
91
100

A dditiv e = 48.3926
Temperature = 150.963
y ield = 93.3601
A dhesion = 45.5183

Adhesion
45
80

150

Hold Values
Time 60

140
130
120
110
100

30
40
Additive

FIGURE 20.13: Overlaid contours for Yield and Adhesion showing feasible region
for desired optimum. The planted ag indicates the optimum values of the responses
along with the corresponding setting of the factors Additive and Temperature (with
Time held at 60.00) that achieve this optimum.
optimum might lie. Observe that while the yield response shows the existence
of a maximum, the adhesion response shows a saddle point.
At this point, several options are available for determining the optimum
settings for these factors: the calculus method, by taking derivatives in Eqs
(20.33) and (20.34), (after setting x2 to its maximum value of 1) and solving
simultaneously for x1 , and x3 , subject to the constraints in the objectives; or
by using various graphically based options available in MINITAB. Since two
dierent responses are involved, to take advantage of the MINITAB options,
it is necessary to overlay the two contours for Yield and Adhesion to see the
region where the objectives are met simultaneously. The MINITAB contour
overlay option, when given the desired values of yield, 91 < y1 < 100, and
desired values for adhesion, 45 < y2 < 80, (the value of 80 is simply a high
enough upper limit) produces the overlaid contour plots shown in Fig 20.13; it
indicates the feasible region as the intersection of the two contours. MINITAB
has a built-in response optimizer that can be used to nd the optimum; it also
has a plant-the-ag option that allows the user to roam over the feasible
region in the overlaid contour plot with the computer mouse, while the values
of the responses at the particular location visited literally scroll by on the
screen. This is another option for nding the optimum.
While the reader is encouraged to explore all the other options, what we
show here in Fig 20.13 is the MINITAB ag planted at the optimum values
found by this exploratory plant-the-ag option. The optimum responses are:
y1 = 93.36%; y2 = 45.52

(20.40)

Application Case Studies III: Statistics

889

found for the factor settings:

Additive = 48.39; Time = 60; Temperature = 151.00

(20.41)

Conrmation
A nal conrmation set of 5 experimental runs, four 22 factorial experiments run in a small region around these optimum settings, 46 < Additive <
50; 140 < Temperature < 160, (with Time set at 60 mins), plus one at the
prescribed optimum itself, resulted in products all with acceptable yield and
adhesions. (Results not shown).

20.6

Summary and Conclusions

The basic premise of Chapters 1219 is that whenever variability and uncertainty are intrinsic to a problem, statistics, building on probability, provides
a consistent set of tools for handling such problems systematically. However,
using simple, tailor-made, textbook examples, to demonstrate how various statistical conceptssampling, estimation, hypothesis testing, regression,
experimental design and analysisare applied is one thing; solving real-life
problems with these statistical techniques is another. Real-life problems are
never ideal; also, solving them typically requires choosing the appropriate
combination of these tools and using them appropriately. This chapter therefore has been devoted to using three classes of real-life problems as a capstone
demonstration of statistics in practice. The rst category of problems required
estimating population parameters from samples, carrying out hypothesis tests
about these populations, and using the thus-validated population models to
solve non-trivial problems. With the second problem we demonstrated the
power of simple regression modeling, and a forensic application of hypothesis testing to detect hidden structure in the data; but the problem also served
to illustrate that there is signicant latitude in carrying out data analysis,
since the problem could have been handled several dierent ways (See Project
Assignment 2).
Ironically, the nal problem, on the application of design of experiments
to industrial process optimization, undoubtedly the most practical of the
collection, is the one whose structure is closest to textbook form: a fractional
factorial design to identify signicant factors, followed by a response surface
design to develop a quadratic response model used for optimization, capped
o by factorial conrmation experiments. But the sense of how long it would
have taken to solve this problem (if at all) without the systematic experimental
design strategy discussed, and how much money (and eort) was saved as a

890

Random Phenomena

consequence, impossible to convey adequately in the presentation, are wellappreciated by practitioners of the art.
Taken together then, these case studies illustrate the many faces of reallife problems to which statistics has been successfully applied. The project
assignments below are oered as a way to broaden the readers perspective of
the themes illustrated in this chapter.
This chapter joins Chapters 7 and 11 to complete this books trilogy of
chapter-length cases studies; it also concludes Part IV. The remainder of the
book, Part V, is devoted to a hand-selected trio of special topics each with
roots in probability and statistics, but all of which have since evolved into
legitimate subject matters in their own rights. These topics are therefore applications of probability and statistics but in a much grander sense.

PROJECT ASSIGNMENTS
1. Eect of Bayesian Prior Distributions on Estimation
Just how much of an eect does the prior distribution used for recursive
Bayesian estimation have on the results? Choose two dierent gamma pdfs
as prior distributions for the unknown Poisson parameter, , and repeat the
recursive Bayesian estimation portion of the Prussian army data analysis case
study, using the data in Table 20.3. For each prior distribution,
Obtain year-by-year estimates; compare them to the results presented
in Table 20.4; plot them as in Fig 20.3 with the maximum likelihood
estimate.
Obtain an explicit expression for the nal posterior distribution; plot
the prior and the nal posterior distributions as in Fig 20.4.
Write a report on your analysis, discussing the eects of the prior distributions
you chose on the recursive parameter estimation process.
2. First Principles Population Dynamics Modeling
Consult references on the mathematical modeling of biological populations
(e.g., Brauer and Castillo-Chavez, (2001)4), and use these concepts to develop
an alternative model to represent the US Population data in Table 20.6. The
following two things are required:
Use only data up to and including 1970 to develop the model; validate
the model by using it to predict the 1980, 1990 and 2000 census results.
Use your validated model to predict the 2010 census.
4 Brauer, F. and C. Castillo-Chavez (2001) . Mathematical Models in Population Biology
and Epidemiology, Springer-Verlag, NY.

Application Case Studies III: Statistics

891

Design Variable

Rotor
Body
Bw
Tape Strips

Design Variable
Bl

Tl
Tail

Paper Clip
Tw
Design Variable

FIGURE 20.14: Schematic diagram of folded helicopter prototype

Write a report on your model development, validation and model prediction.
You have considerable latitude in how to complete this assignment. Be creative.
3. Experimental Design and Analysis
The length dierential between the ring and index ngers on the human
hand has been postulated to be a subtle but not yet well-understood distinguishing factor between men and women. You are required to nd nM male
subjects and nF female subjects, acquire and analyze data on the ring-index
ngers length dierential, RI , and conrm or refute this postulate.
Write a report in which you state the project objectives, justify your choice
of sample sizes nM and nF , describe your measurement procedure clearly, show
your data tables (include the rst names of your subjects), and present your
analysis of the data and your conclusions.
4. Process Development and Optimization
The objective of this assignment is to develop a paper helicopter that has
the maximum hang time from a xed height, and to develop a model that
will allow you to predict the hang time based on whatever design features you
nd to be important. Apply a comprehensive design of experiments strategy.
A schematic diagram of the folded helicopter is shown in Fig 20.14; a
beginning template that can be photocopied unto blank sheets of paper
and then cut and folded to make the various prototypes, is shown in Fig
20.15.
Consider beginning with screening experiments to identify the factors
that aect the helicopters ight time.

892

Random Phenomena

Conduct a series of experiments that will ultimately lead to a mathematical model and an optimal design.
Predict the maximum ight time and perform experiments to conrm
this prediction.
Write a report summarizing your design at the prototype stage, and the
analysis leading to the optimum design. Discuss your analysis methods and
show the nal design, the results of the model predictions, and the conrmation of your predictions.

Application Case Studies III: Statistics

893

FIGURE 20.15: Paper helicopter prototype

894

Random Phenomena

Part V

Applications

895

897

Part V: Applications
Dealing with Random Variability in Practice

What is laid down, ordered, factual, is never enough to embrace

the whole truth; life always spills over the rim of every cup.
Boris Pasternak (18901960)

898

Part V: Applications
Dealing with Random Variability in Practice

Chapter 21: Reliability and Life Testing

Chapter 22: Quality Assurance and Control
Chapter 23: Introduction to Multivariate Analysis

Chapter 21
Reliability and Life Testing

21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.2 System Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.2.1 Simple Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Series Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Components and Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.2.2 Complex Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Series-Parallel Congurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
k-of-n Parallel System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Systems with Cross-links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.3 System Lifetime and Failure-Time Distributions . . . . . . . . . . . . . . . . . . . . . . . . .
21.3.1 Characterizing Time-to-Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Survival Function, S(t) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Hazard Function, h(t): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.3.2 Probability Models for Distribution of Failure Times . . . . . . . . . . . .
21.4 The Exponential Reliability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.4.1 Component Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.4.2 Series Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.4.3 Parallel Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.4.4 m-of-n Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.5 The Weibull Reliability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.6 Life Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.6.1 The Exponential Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Precision of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.6.2 The Weibull Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.7 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES AND APPLICATION PROBLEMS . . . . . . . . . . . . . . .

900
901
901
901
903
905
906
906
907
909
910
911
911
911
913
914
914
915
916
917
917
918
919
920
920
920
922
922
923
925

The shifts of Fortune test the reliability of friends.

Cicero (10643 BC)

The aphorism Nothing lasts forever (or any of its sundry variations), has
long served philosophers, ancient and modern, as a concise way to convey the
transient and ephemeral nature of the world around us. Specically for the
material things, however, the issue is never so much about not lasting forever,
which is certain; it is more about how long they will last, which is uncertain.
These manufactured and engineered products consist of components with nite functioning lifetimes, and the failure of any of the individual constituent
899

900

Random Phenomena

components has repercussions on the ability of the overall system to function.

But the failure of a component, or of the overall system, is subject to random
variability, so that items manufactured at the same site by the same crew of
operators and used essentially in the same manner will fail at dierent times.
Reliability, the attribute of an individual component or a system which
indicates how long it is expected to function as prescribed, will be studied in
this chapter. With roots deeply anchored in probability theory and statistics,
reliability theory, and life testing, its experimental counterpart, have jointly
evolved into an important and extensive eld of study. The discussion here is
therefore designed to be illustrative rather than exhaustive. Still, enough of the
essential material will be presented to provide the reader with an appreciation
of the basic principles and practical applications.

21.1

Introduction

Engineered and natural systemschemical processes, mechanical equipment, electrical devices, even the human body, etc., consist of individual
units connected in a logical fashion for achieving specic overall system goals.
A basic principle underlying such systems is that how the overall system
performs depends on the individual components performance and how the
components are connected. Plant equipment do fail, as do mechanical and
electrical systems; automobiles break down; and human beings fall sick and
eventually die. While the issue of safety and the consequences of system failure typically dominate most discussions about system performance, of equal
importance is the issue of how reliable these various systems and their constituent components are. For how long will the car start every morning? How
long can the entire renery operate before we need to shut it down for maintenance? How long will the new dishwasher last? These are all issues of reliability,
a concept deeply inuenced by variability and uncertainty surrounding such
questions. The subject matter of reliability therefore relies heavily on probability theory and statistics. The following is one version of a formal denition:

The reliability of a component, or a system (of components), is the

probability that it will function properly within prescribed limits
for at least a specied period of time, under specied operating
conditions.

All the qualiers italicized in this denition are important. For example, what
is satisfactory for laboratory use may be inadequate for the harsh commercial

Reliability and Life Testing

901

plant environment; and what is expected of an inexpensive disposable pen

is dierent from what is expected of the more expensive brand. Giving all
due respect to such qualiers, we may now give the following mathematical
denition:

The reliability of a component or a system, R(t), is the probability

that it survives for a specied period of time, t, i.e.,
R(t) = P (T > t)

(21.1)

Component reliability, Ri (t), is the reliability of an individual component, i;

while system reliability, Rs (t), is the reliability of the overall system. Clearly,
Rs is a function of Ri , and given Ri and how the components are connected to
constitute the overall system, one can compute Rs using techniques discussed
in the next section.

21.2

System Reliability

As one would expect, the overall reliability of a system is determined by:

1. Ri , the reliability of constituent components; and,
2. How the components are connected to form the overall system, i.e., the
system conguration.
The dening problem of system reliability is therefore easily stated: Given
Ri and the system conguration, nd Rs . The presentation in this section
is therefore organized around system congurations, beginning with simple
systems and building up to more complex ones. The companion issue of how
the individual reliabilities are determined belongs under the topic of statistical
life testing, which is discussed in Section 21.6.

21.2.1

Simple Systems

The n components Ci : i = 1, 2. . . . , n, of a system can be congured in

series, or in parallel, as shown in Fig 21.1, or as a combination series-parallel
arrangement, as in Fig 21.2 (shown here for n = 6), with or without crosslinking between the components. A system is said to be simple if it consists
of either a straightforward series arrangement, or a straightforward parallel
arrangement; otherwise it is said to be complex.

902

Random Phenomena

Cn-1

Series Configuration

C1
C2
.
.
.
Cn-1
Cn
Parallel Configuration

FIGURE 21.1: Simple Systems: Series and parallel conguration

C4
C1

FIGURE 21.2: A series-parallel arrangement of a 6-component system

Reliability and Life Testing

903

Series Systems
Consider the system depicted in the top panel of Fig 21.1, where the components are connected in series and the reliability of component Ci is Ri . If
each component operates mutually independently of the others, by which we
mean that the performance of one component has no eect on the performance
of any other component, then:
1. If any component fails, the entire system fails; consequently,
2. The system reliability is obtained as
Rs =

(21.2)

i=1

Eq (21.2) arises because, by denition, Rs is the probability that the entire

system functions, i.e., P (C1 and C2 and . . . Cn all function). By independence,
therefore,
Rs

P (C1 functions) P (C2 functions) P (Cn functions)

R1 R2 Rn

(21.3)

as required. Eq (21.2) is known as the product law of reliabilities. Now, because

reliability is a probability, it follows that 0 < Ri < 1; as a result, an important implication of this product law of reliabilities is that as n increases, Rs
decreases. Thus, a systems reliability decreases as the number of components
increases. Intuitively, this makes sense: as we increase the number of components that must function simultaneously for the entire system to function, we
provide more opportunities for something to go wrong, thereby decreasing the
probability that the entire system will function.
Example 21.1: RELIABILITY OF 4-COMPONENT AND 6COMPONENT SERIES SYSTEMS
If four identical components each with reliability Ri = 0.98 are connected in series, the system reliability is
Rs4 = 0.984 = 0.922

(21.4)

If two more identical components are added in series, the system reliability becomes
(21.5)
Rs6 = 0.922 0.982 = 0.886
The system with the higher number of components in series is therefore
seen to be much less reliable.

Note that as a consequence of the basic laws of the arithmetic operation of

multiplication, system reliability for a series conguration is independent of
the order in which the components are arranged.

904

Random Phenomena

Parallel Systems
Now consider the system depicted in the bottom panel of Fig 21.1 where the
components are arranged in parallel, and again, the reliability of component
Ci is Ri . Observe that in this case, if one component fails, the entire system
does not necessarily fail. In the simplest case, the system fails when all n
components fail. In the special k-of-n case, the system will function if at
least k of the n components function. Let us consider the simpler case rst.
In the case when the system fails only if all n components fail, then Rs
is the probability that at least one component functions, which is equivalent to 1 P (no component functions). Now, let Fi be the unreliability of
component i, the probability that the component does not function; then by
denition,
(21.6)
Fi = 1 Ri
If Fs is the system unreliability, i.e., the probability that no component in
the system functions, then by independence,
Fs =

(21.7)

i=1

and since Rs = 1 Fs , we obtain

Rs = 1

(1 Ri )

(21.8)

i=1

For parallel systems, therefore, we have the product law of unreliabilities expressed in Eq (21.7), from which Eq (21.8) follows. Specic cases of Eq (21.8)
can be informative, as the next example illustrates.
Example 21.2: RELIABILITY OF 2-COMPONENT PARALLEL SYSTEM
Obtain an explicit expression for the reliability of a system consisting of
a parallel arrangement of two components, C1 and C2 , with respective
reliabilities, R1 and R2 . Explain in words what the expression for the
system reliability means in terms of the status of each component.
Solution:
From Eq (21.8), we have, for the two component system,
Rs = 1 (1 R1 )(1 R2 ) = R1 + R2 R1 R2

(21.9)

This expression can be rearranged in one of two equivalent ways:

R1 + R2 (1 R1 ) = R1 + R2 F1

(21.10)

R2 + R1 (1 R2 ) = R2 + R1 F2

(21.11)

In words, Eq (21.10) indicates that the entire system functions if (a)

Reliability and Life Testing

905

C1 functions regardless of the status of component C2 (with a probability R1 ), or (b) C2 functions when C1 fails, with probability R2 F1 . Eq
(21.11) expresses the mirror image circumstance. Thus, these two equivalent expressions show how, in this parallel arrangement, one component
serves as a backup for the other.

Since 0 < Ri < 1, it is also true that 0 < (1 Ri ) < 1. As a result

of Eq (21.8), as n increases in a parallel conguration, Rs also increases.
Again, this makes sense intuitively: each additional component in the parallel
conguration provides an additional level of redundancy, thereby increasing
the probability that at least one of the components will continue to function.
Example 21.3: RELIABILITY OF 2-COMPONENT AND 4COMPONENT PARALLEL SYSTEMS
If two of the identical components of Example 21.1 are now arranged in
parallel (instead of in series), rst obtain the system reliability for this
2-component parallel conguration. If two more identical components
are added in parallel, obtain the system reliability for the resulting 4component parallel conguration.
Solution:
In this case, the system reliability for the 2-component system is
Rs = 1 (1 0.98)2 = 1 (0.02)2 = 0.9996

(21.12)

When two more components are added in parallel, the system reliability
becomes
(21.13)
Rs = 1 (0.02)4 = 0.99999984
where it is necessary to retain so many decimal places to see that the
system does not quite possess absolutely perfect reliability, but it is very
close.
Thus, we see that by adding more components in parallel, the system
reliability is improved substantially, a reverse of the case with the series
arrangement.

Again, from Eq (21.8) and the laws of multiplication, the order in which the
components are arranged in the parallel conguration is immaterial to the
value of Rs .
Components and Modules
If a box drawn around a set of component blocks in the systems representation has a single line going into the box, and a single line coming out of
it, the collection of components is a called a module. For example, the entire
collection of components in the series arrangement in Fig 21.1 constitutes a
module. Of course, several smaller modules can be created from this larger
one by drawing the module box to contain fewer components. A single component is the smallest (simplest) module. Observe that the entire collection of

906

Random Phenomena

components in the parallel arrangement in Fig 21.1 also constitutes a module,

but smaller modules can also be created from this largest one.
The reliability of a system consisting entirely of modules is obtained by
rst nding the module reliabilities and then combining these according to
how the modules are congured to constitute the complete system. For simple
systems, the individual components are all modules, which is why reliability
analysis for such systems can proceed in the direct manner discussed above.
This is not so for complex systems.

21.2.2

Complex Systems

Complex systems arise from a combination of simple systems. This fact

makes it possible to invoke the results and concepts developed earlier in obtaining the reliabilities of complex systems. How these results are employed
depends on the nature of the complexity the conguration and the requirements, as we now discuss.
Series-Parallel Congurations
Series-parallel systems are complex because they consist of a mixture of
simple series and simple parallel systems. The analysis technique therefore
consists of rst consolidating each parallel ensemble in the system into a single
module; this has the net eect of reducing the system to one containing only
series modules (since the original components in the series arrangement are
all modules) so that the results obtained earlier can then be applied.
Let us use the system in Fig 21.2 to illustrate. We begin by consolidating
the parallel subsystem consisting of components C3 , C4 , and C5 into a single
composite module, say C345 , with reliability R345 . Then, from Eq (21.8),
R345 = 1 (1 R3 )(1 R4 )(1 R5 )

(21.14)

As a result of this consolidation, the entire system now consists strictly of a

series arrangement of three single components, C1 , C2 and C6 , along with the
composite module C345 , with the result that the system reliability is:
Rs

=
=

R1 R2 R345 R6
R1 R2 R6 [1 (1 R3 )(1 R4 )(1 R5 )]

(21.15)
(21.16)

The following example illustrates the application of these principles.

Example 21.4: RELIABILITY OF SAMPLING-ANALYZER
SYSTEM
A system for analyzing the composition of a laboratory-scale distillation columns products consists of a side-stream sampling pump, a
solenoid valve and a densitometer. (1) Obtain the overall reliability of
the sampling-analyzer system if the components are congured as in
Fig 21.3, with the reliability of each component as shown. (2) Because

Reliability and Life Testing

907

R1= 0.9997

R2= 0.9989

R3= 0.9996

Sampling Pump

Solenoid Valve

Densitometer

FIGURE 21.3: Sampling-analyzer system: basic conguration

Solenoid Valve 2
R2= 0.9989

R1= 0.9997
Sampling Pump

R3= 0.9996
R2= 0.9989

Densitometer

Solenoid Valve 1

FIGURE 21.4: Sampling-analyzer system: conguration with redundant solenoid valve

the solenoid valve is the least reliable of the components, it has been
suggested to add another identical solenoid valve in parallel for redundancy, resulting in the conguration in Fig 21.4. Obtain the system
reliability for this new conguration and compare it to that for the
basic conguration.
Solution:
(1) The system reliability for this simple series conguration is obtained
as a product of the indicated reliabilities, or
Rs = 0.9982

(21.17)

indicating a 99.82% system reliability, or a 0.18% chance of system

failure.
(2) By rst consolidating the parallel solenoid valve components,
the reliability of the complex system is obtained as:
Rs = 0.9997 [1 (1 0.9989)2 ] 0.9996 = 0.9993

(21.18)

so that the system reliability is improved from 99.82% to 99.93% by the

addition of one more solenoid valve in parallel.

k-of-n Parallel System

Instead of the simpler case where the system fails if and only if all of the
components fail, we now consider the case when at least k of the n components
are required to function for the entire system to function.
If each component Ci has identical reliability Ri = R, then according to the
stated requirement, Rs is the probability that at least k out of n components

908

Random Phenomena

function. The event that at least k components out of n function is composed

of several mutually exclusive events: Ek|n , the event that k components out
of n function and n k do not; or Ek+1|n , where k + 1 components function
and (n k 1) do not; . . . or, En|n , all n function.
Now, because the probability that each Ci functions is constant and equal
to R (as a result of component independence), observe that the probability
that k components function and n k do not function is akin to the binomial
problem in which one obtains k successes in n trials, i.e.,

P Ek|n =

n!
Rk (1 R)nk
k!(n k)!

Thus, the required system reliability is obtained as:

Rs = P Ek|n + P Ek+1|n + + P En|n
n

n!
Ri (1 R)ni
=
i!(n i)!

(21.19)

(21.20)

i=k

Again, note that because all components Ci are identical, with identical reliabilities, which of the n belongs to the functioning group of k is immaterial.
In the case where
1. The reliabilities might be dierent, or
2. Specic components Ci : i = 1, 2, . . . , (k 1) with corresponding reliabilities, Ri , are required to function, along with at least one of the
remaining (n k + 1) components,
then, for such a k-out-of-n system, the system reliability is:

k1
n

Rs =
Ri 1
(1 Ri )
i=1

(21.21)

i=k

exactly like (k1) components in series connected to (nk+1) components in

parallel. This is because the stipulation that components Ci ; i = 1, 2, . . . , (k
1), function is akin to a series connection, with the remaining (n k + 1) as
redundant components.
Example 21.5: RELIABILITY OF PARALLEL COMPUTING
SYSTEM
A high-performance computing system used for classied cryptography
studies consists of a bank of 10 computers, 7 of which are high-end
workstations with equal reliability Ri = 0.999, congured in parallel;
the remaining 3 are backup low-end workstations with equal but lower
reliability Ri = 0.9. The performance requirement is that 6 of the highend computers must function along with at least one of the remaining
4, i.e., the workload can be handled only by 7 high-end computers, or
by 6 high-end computers plus no more than one low-end backup. What

Reliability and Life Testing

Pump 1

Valve 1
(Air-to-Open)

Pump 2

909

Valve 2
(Air-to-Open)

FIGURE 21.5: Fluid ow system with a cross link

is the reliability of this 7-of-10 parallel computing system?
Solution:
In this case, from Eq (21.21)

!
"
Rs = (0.999)6 1 (1 0.999) (1 0.9)3
=

0.9940149

(21.22)

just a hair over 99.4%, the reliability of the module of 6 required highend workstations. Thus, the combination of the extra high-end workstation plus the 3 back-up low-end ones has the net eect of essentially
preserving the reliability of the mandatory module of 6.

Systems with Cross-links

Consider the system shown in Fig 21.5, a uid ow system consisting of
two pumps and two air-to-open valves with the following failure-mode characteristics: a failed pump is unable to move the uid through the valve; and
the air-to-open valves, which fail shut, are unable to let the uid through.
This system consists of a parallel conguration of two Pump-Valve modules
with a cross-link from the lower pump to the upper valve (from component
C3 to C2 ). Were it not for the presence of this cross-link, the series-parallel
arrangement is easily dealt with using results obtained earlier. But as a result
of the cross-link, components C3 and C2 are not modules, while all the other
components are. This complicates matters a bit, so that analyzing systems
with cross-links requires special considerations.
First, we need to introduce some useful notation. Let P (Ci ) be the probability that component Ci functions, and let P (Ci ) be the probability that
Ci does not function, i.e., the probability that component Ci has failed. Note
that by denition of reliability,
P (Ci ) =
P (Ci ) =

Ri
1 Ri

(21.23)
(21.24)

910

Random Phenomena

Similarly, P (S), the probability of the entire system functioning, is Rs .

To obtain the system reliability, Rs , for systems with cross-links, we must
choose one of the components as the keystone component, Ck , on which the
analysis is based. We then compute the following conditional probabilities:
P (S|Ck ), the probability that the system functions given that Ck is functioning, and P (S|Ck ), the probability that the system functions given that
Ck has failed. From these partial probabilities, we are able to invoke a result
from Chapter 3, the Theorem of total probability (Eqs (3.47) and (3.53)),
to obtain,
P (S) =
=

P (S|Ck )P (Ck ) + P (S|Ck )P (Ck )

P (S|Ck )P (Ck ) + P (S|Ck )[1 P (Ck )]

(21.25)

Returning to the example in Fig 21.5, let us choose component C3 as the

keystone; i.e.,
Ck = C3
(21.26)
We may now observe the following:
1. Because C1 and C3 are in parallel, if C3 functions, then the system functions if either C2 or C4 functions; the status of C1 is therefore immaterial
in this case. Thus,
P (S|C3 ) = 1 (1 R2 )(1 R4 )

(21.27)

2. If C3 does not function, the system will function only if C1 and C2

function; the status of C4 is immaterial. And since C1 and C2 are in
series,
P (S|C3 ) = R1 R2
(21.28)
And now, from Eq (21.25), we obtain:
P (S) = [1 (1 R2 )(1 R4 )]R3 + R1 R2 (1 R3 )

(21.29)

which simplies to give:

Rs = R1 R2 + R2 R3 + R3 R4 R1 R2 R3 R2 R3 R4

(21.30)

In principle, it does not matter which component is chosen as the keystone;

the same result is obtained. In actual fact, however, the resulting analysis is
more straightforward if one of the components associated with the cross-link
is chosen as the keystone. As an exercise, the reader should select C1 as the
keystone and use it to obtain the expression for Rs .
For more complicated systems, one needs as many keystones as there are
cross-links.

Reliability and Life Testing

21.3
21.3.1

911

System Lifetime and Failure-Time Distributions

Characterizing Time-to-Failure

From the denition in Eq (21.1), we know that system or component reliability has to do with the probability of the entity in question remaining in
service beyond a given time, t. The so-called system lifetime (or component lifetime) is therefore a random variable, T , having a pdf, f (t), sometimes
known as the time-to-failure (or failure-time) distribution.
We now recall the discussion in section 4.5 of Chapter 4 and note that
even though as a pdf, f (t) can be studied in its own right, there are other
even more relevant ways of characterizing the random variable, T , in addition
to what f (t) provides. These functions were introduced in Chapter 4, but are
re-visited here in their more natural setting.
The Survival Function, S(t)
The survival function was dened in Chapter 4 as
S(t) = P (T > t) = 1 F (t)

(21.31)

but, as we know, this is actually identical to the reliability function dened

earlier in this chapter, in Eq (21.1). The survival function is therefore also
known as the reliability function.
The F (t) noted above is the cumulative distribution function, which, by
denition, is
t
F (t) =
f ( )d
(21.32)
0

But in the specic case where f (t) is a failure-time distribution, this translates
to the probability that a component or system fails before T = t, making F (t)
the complement of the reliability function (as already implied, of course, by
Eq (21.31)). Thus, in lifetime studies, the cdf, F (t), is also the same as the
system unreliability, something we had alluded to earlier in dealing with the
parallel system conguration (see Eq (21.6)), but which was presented simply
as a denition then.
The Hazard Function, h(t):
This function, dened as:
h(t) =

f (t)
f (t)
=
R(t)
1 F (t)

(21.33)

is the instantaneous failure rate, or simple failure rate. Recall from Chapter
4 that h(t)dt is the probability of failure in the interval (t, t+dt), given survival

912

Random Phenomena

until time t, in precisely the same way that f (x)dx is the probability of a
continuous random variable, X, taking on values in the interval (x, x + dx).
The relationship between the hazard function and several other functions
is of some importance in the study of component and system lifetimes. First,
it is related to the reliability function as follows:
From the denition of R(t) as 1 F (t), taking rst derivatives yields
R (t) =

dR(t)
= f (t)
dt

(21.34)

so that, from Eq (21.33),

h(t) =
=

R (t)
R(t)
d
[ln R(t)]
dt

(21.35)

The solution to this ordinary dierential equation is easily obtained as:

R(t) = e

t
0

h( )d

(21.36)

Finally, since, from Eq (21.33) f (t) = h(t)R(t), the relationship between

the hazard function and the standard pdf f (t) is
f (t) = h(t)e

t
0

h( )d

(21.37)

The typical hazard function (or equivalently, failure rate) curve is shown
in Fig 21.6. This is the so-called bathtub curve for representing failure characteristics of many realistic systems, including human mortality. Before discussing the characteristics of this curve, it is important, rst, to clear up a
popular misconception. The rate in failure rate is not with respect to time;
rather it is the proportion (or percentage) of the components surviving until
time t that are expected to fail in the innitesimal interval (t, t + t). This
rate is comparable to the rate in interest rate in nance, which refers not
to time, but to the proportion of principal borrowed.
The failure rate curve is characterized by 3 distinct parts:
1. Initial Period : t t0 , characterized by a relatively high failure rate
that decreases as a function of time. This is the early failure period
where inferior items in the population fail quickly. This is also known
as the infant mortality period, the analogous characteristic in human
populations.
2. Normal Period : t0 t t1 , characterized by constant failure rate.
This is the period of useful life of many products where failure is due to
purely random chance, not systematic problems.
3. Final period : t t1 , characterized by increasing failure rate primarily attributable to wear-out (the human population analog is old-age
mortality).

Reliability and Life Testing

913

Failure
Rate

Infant
Mortality

Old-Age
Mortality

Early
failure

Random
Chance
failure

Wearout
failure

Time

FIGURE 21.6: Typical failure rate (hazard function) curve showing the classic three
distinct characteristic periods in the lifetime distributions of a population of items

In light of such characteristics, manufacturers often improve product reliability by (i) putting their batch of manufactured products through an initial
burn in period of pre-release operation until time t0 to weed out the inferior items, and (ii) where possible, by replacing (or at least recommending
replacement) at t1 to avoid failure due to wear-out. For example, this is the
rationale behind the 90,000-mile timing belt replacement recommendation for
some automobiles.

21.3.2

Probability Models for Distribution of Failure Times

One of the primary utilities of the expression in Eq (21.37) is that given

any hazard function (failure rate), the corresponding distribution of failure
times can be obtained directly. For example, for random chance failures, with
constant failure rate, , this equation immediately yields:
f (t) = et

(21.38)

recognizable as the exponential pdf. We may now recall the discussion in

Chapter 9 where the waiting time to the rst occurrence of a Poisson event
was identied as an exponential random variable. If component failure is a
Poisson event, then Eq (21.38) is consistent with that earlier discussion.
Thus, for constant failure rate lifetime models, the failure-time distribution, f (t), is exponential, with parameter as the failure rate. The meantime-to-failure is then 1/. If the component is replaced upon failure with an
identical one (with the same constant failure rate, ), then 1/ is known as
the mean-time-between-failures (MTBF).
For components or systems with the exponential failure-time distribution

914

Random Phenomena

in Eq (21.38), the reliability function is:

R(t) = 1 F (t) = et .

(21.39)

This reliability function is valid in the normal period of the product lifetime,
the middle section of the failure rate curve.
During the initial and nal periods of product lifetimes, the failure
rates are not constant, decreasing in one and increasing in the other. A more
appropriate failure-rate function is
h(t) = (t)1 ; t > 0

(21.40)

a very general failure rate function: when < 1 it represents a failure rate
that decreases with time, the so-called decreasing failure rate (DFR) model;
> 1 represents an increasing failure rate (IFR) model; and when = 1,
the failure rate is constant at . This expression therefore covers all the three
periods of Fig 21.6.
The corresponding pdf, f (t), for this general hazard function is obtained
from Eq (21.37) as
f (t) = (t)1 e(t)

(21.41)

recognizable as the Weibull pdf. The reliability function is obtained from Eq

(21.33) as:
R(t) = e(t)

(21.42)

so that the cdf is therefore:

F (t) = 1 e(t)

21.4
21.4.1

(21.43)

The Exponential Reliability Model

Component Characteristics

If the failure rates of component Ci of a system can be considered constant,

with value i , then
1. fi (t), the failure-time distribution, is exponential, from which,
2. Ri (t), the component reliability function (as a function of time-inservice, t) is obtained as,
Ri (t) = ei t

(21.44)

Reliability and Life Testing

915

From here, we can compute Rs (t), the reliability of a system consisting of

n such components, once we are given the system conguration; and from
Rs (t), we can then compute the systems MTBF. Let us illustrate rst what
Eq (21.44) means with the following example.
Example 21.6: TIME-DEPENDENCY OF COMPONENT
RELIABILITY
Consider a component with failure rate 0.02 per thousand hours; i.e.,
i = 0.02/1000. Obtain Ri (1000) and Ri (5000), the probabilities that
the component will be in service respectively for at least 1000 hours,
and for at least 5000 hours. Also obtain the MTBF.
Solution:
The probability that this component will be in service for at least 1000
hours is obtained as
Ri (1000) = e[(0.02/1000)1000] = e0.02 = 0.98

(21.45)

The probability that the same item will remain in service for at least
5000 hours is
Ri (5000) = e[(0.02/1000)5000] = e0.1 = 0.905

(21.46)

Thus, if the time-in-service is changed, the reliability will also change;

and for components with exponential failure-time distributions, the
time-dependency is represented by Eq (21.44). The longer the required
time-in-service for these components the lower the reliability. In the
limit as t , the probability that such components remain in service
goes to zero: (nothing lasts forever!)
For this component, the mean-time-between-failure is
M T BFi =

1
1000
=
= 5 104 hrs

0.02

(21.47)

which is constant.

21.4.2

Series Conguration

For a system consisting of n components in series, each with reliability

Ri (t) as given above in Eq (21.44), the resulting system reliability is
Rs (t) =
=

ei t = e(

i=1
s t

where
s =

i=1

i )t

(21.48)
n

i=1

(21.49)

916

Random Phenomena

Thus, the failure-time distribution of a series conguration of n components,

each having an exponential failure-time distribution, is itself an exponential
distribution.
The systems MTBF is obtained as follows: for a component with reliability
given in Eq (21.44), the component MTBF, say i , is:
i =

1
i

(21.50)

As such, for the system, with Rs as given in Eq (21.48), the MTBF, s , is

given by:
1
(21.51)
s =
s
with s as dened in Eq (21.49). Thus,
s =
so that:

1
=
1 + 2 + + n

1
1

1
2

1
+ +

1
1
1
1
=
+
+ +
s
1
2
n

1
n

(21.52)

(21.53)

i.e., the MTBF of a series conguration of n components, each with individual

MTBF of i , is the harmonic mean of the component MTBFs.
In the special case where all the components are identical, (in which case,
i = ), and therefore i = , then
s

(21.54)
(21.55)

i.e., the system failure rate is n times the component failure rate, and the
MTBF is 1/n times that of the component.

21.4.3

Parallel Conguration

The reliability Rs of a system consisting of n components in parallel, each

with reliability Ri (t) as given above in Eq (21.44), is given by:
Rs (t) = 1

n

1 ei t

(21.56)

i=1

which is not the reliability function for an exponential failure-time distribution. In the special case where the component failure rates are identical, this
expression simplies to,
n

(21.57)
Rs (t) = 1 1 et

Reliability and Life Testing

917

In general, the expression for the MTBF is dicult to derive, but it can be
shown that it is given by:

1
1
s = 1 + + +
(21.58)
2
n
and the system failure rate by
1
1
=
s

1
1
1 + + +
2
n

(21.59)

Some important implications of these results for the parallel system conguration are as follows:
1. The MTBF for a system of n identical components in parallel is the
indicated series sum to n terms multiplied by the individual component
MTBF. Keep in mind that this assumes that each defective component
is replaced when it fails (otherwise n cannot remain constant).
2. In going from a single component to two in parallel, the MTBF increases
by a factor of 50% (from to 1.5), not 100%.
3. The law of diminishing returns is evident in Eq (21.59): as far as
MTBF for a parallel system conguration is concerned, the incremental
benets accruing from adding one more component to the system, goes
to zero as n .

21.4.4

m-of-n Parallel Systems

When the system status depends on m components all functioning, then

it will take multiple sequential failures (n m = k of such) for the entire
system to fail. If each component is independent, and each has an exponential
failure-time distribution with common failure rate , then it can be shown
quite straightforwardly (recall the discussions in Chapter 9) that the system
failure-time distribution is the gamma distribution:
fs (t) =

k t k1
e t
(k)

(21.60)

with k = n m. This is the waiting time to the occurrence of the k th Poisson

event when these events are occurring at a mean rate . (This can also be
written explicitly in terms of m and n simply by replacing k with (n m).)
Since this is the pdf of a gamma (k, 1/) random variable, whose mean value
is therefore k/, we immediately obtain the MTBF for this system as:
M T BFs = s =

k
= (n m)

(21.61)

918

21.5

Random Phenomena

The Weibull Reliability Model

As noted earlier, the exponential reliability model is only valid for constant
failure rates. When failure rates are time dependent, the Weibull model is
more appropriate. Unfortunately, the Weibull model, being quite a bit more
complicated than the exponential model, does not lend itself as easily to the
sort of closed form analysis presented above for the exponential counterpart.
The component reliability is given from Eq (21.42) by:
Ri (t) = e(i t)

(21.62)

and when the failure rate exponent is assumed to be the same for n components connected in series, the resulting system reliability is:
Rs (t) =

e(i t) = e(

i=1

i )t

i=1

e(s t)

where:

s =

(21.63)
1/
i

(21.64)

i=1

Thus, as with the exponential case, the failure-time distribution of a series

conguration of n components, each having a Weibull failure-time distribution,
W (i , ) (i.e., with identical ), is itself another Weibull distribution, W (s , ).
If is dierent for each component, the system reliability function is quite a
bit more complicated.
From the characteristics of the Weibull random variable, the component
MTBF is obtained as:
1
(21.65)
i = (1 + 1/)
i
As such, the system MTBF with Rs as in Eq (21.63) is:
s =

1
(1 + 1/)
(1 + 1/) =
1/
n
s

i=1 i

(21.66)

again, provided that is common to all components, because only then is Eq

(21.64) valid.
The system reliability function, Rs (t) for the parallel conguration is
n '
(

Rs (t) = 1
1 e(i t)

(21.67)

i=1

and even with i = and a uniform , the expression is quite complicated,

and must be computed numerically.

Reliability and Life Testing

21.6

919

Life Testing

The experimental procedure for determining component and system reliability and lifetimes parallels the procedure for statistical inference discussed
in Part IV: it involves selecting an appropriate random sample of components and testing them under prescribed conditions. The relevant data are
the times-to-failure observed for individual components of system. Such experiments are usually called life tests and the general procedure is known as
life testing. There are several dierent types of life tests, a few of the most
common of which are listed below:
1. Replacement tests: where each failing component is replaced by a new
one immediately upon failure;
2. Nonreplacement tests: where a failing component is not replaced;
3. Truncated tests: where, because the mean lifetime is so long that testing to failure is impractical, uneconomical, or both, the test is stopped
(truncated) after (i) a xed pre-specied time, or (ii) the rst r < n
failures;
4. Accelerated tests: where high-reliability components are tested under
conditions far more severe than normal, in order to accelerate component failure and thereby reduce test time and the total number of components to be tested. The true natural reliability is extracted from such
accelerated tests via standard analysis tools calibrated for the implied
time-compression.
Once more, we caution the reader that the ensuing abbreviated discussion is
meant to be merely illustrative, nowhere near the fuller, more comprehensive
discussion of the fundamental principles and results that are available in such
book-length treatments as that in Nelson, 20031.

21.6.1

The Exponential Model

As shown earlier in Section 21.4, the exponential model is the most appropriate lifetime model during the useful life period. The main feature of the life
tests for this model is that n components are life-tested independently and
testing is discontinued after r n have failed. The experimental result is the
set of observed failure times: t1 t2 t3 tr , where ti is the failure time
of the ith component to fail.
Statistical inference in this case involves the usual problems: estimation
of the key population parameter, = 1/ of the exponential failure-time
1 Nelson,

W. B. (2003). Applied Life Data Analysis, Wiley, NY.

920

Random Phenomena

distribution, the mean component lifetime; and hypothesis testing about the
parameter estimate, but with a twist.
Estimation
It can be shown that an unbiased estimate for is given by:

r
r

(21.68)

where r is the accumulated life until the rth failure given by:
r =

ti + (n r)tr

(21.69)

i=1

for non-replacement tests. The rst term is the total lifetime of the r failed
components; the second term is the lower bound on the remaining lifetime of
the surviving (n r) components. Note that for these non-replacement tests,
if r = n, then
is exactly equal to the mean of the observed failure times.
For replacement tests,
(21.70)
r = ntr
From here, the failure rate is estimated as

and the reliability as

= 1/

(21.71)

= et/
R(t)

(21.72)

It can be shown that these estimates are biased, but the bias diminishes as
the sample size n increases.
Precision of Estimates
The statistic
Wr =

2Tr

(21.73)

where Tr is the random variable whose specic value, r , is given in Eq (21.69)

or (21.70), possesses a 2 (2r) distribution. As a result, the usual two-sided
(1 ) 100% condence interval for may be obtained from the fact that
'
(
P 21/2 (2r) < Wr < 2/2 (2r) = 1
(21.74)
following precisely the same arguments as in Chapter 14. The result is that:
2Tr
2
/2 (2r)

2Tr
2
1/2 (2r)

represents the (1 ) 100% condence interval for .

(21.75)

Reliability and Life Testing

921

TABLE 21.1:

Summary of H0
rejection conditions for the test of
hypothesis based on an exponential model
of component failure-time
For general
Testing Against Reject H0 if:
Ha : < 0
r < 21 (2r)
Ha : > 0

r > 2 (2r)

Ha : = 0

r < 21/2 (2r) or

r > 2/2 (2r)

Hypothesis Testing
To test the null hypothesis
H0 : = 0
against the usual triplet of alternatives:
Ha :
Ha :

> 0
< 0

Ha :

= 0

again, we follow the principles discussed in Chapter 15. We use the test statistic Wr dened in Eq (21.73) and its sampling distribution, the 2 (2r) distribution, to obtain the usual rejection criteria, shown for this specic case
in Table 21.1, where r is the specic value of the statistic obtained from
experimental data, i.e.,
2r
r =
(21.76)
0
Even though all these closed-form results are available, none of these statistical inference exercises are conducted by hand any longer. As with the
examples discussed in Chapters 14 and 15, computer programs are routinely
used for such data analysis. Nevertheless, we use the next example to illustrate
some of the mechanics behind the computations.
Example 21.7: LIFE TESTS FOR ENERGY-SAVING LIGHT
BULBS
To characterize the lifetime of a new brand of energy-saving light bulbs,
a sample of 10 were tested in a specially designed facility where they
could be left on continuously and monitored electronically to record the
precise number of hours until burn out. The experimental design calls
for halting the test immediately after 8 of the 10 light bulbs have burned

922

Random Phenomena
out. The result, in thousands of hours, arranged in increasing order is
as follows:
(1.599, 3.380, 5.068, 8.478, 8.759, 9.256, 11.475, 14.382)
i.e., the rst light bulb to burn out did so after 1,599 hours, the next
after 3,380 hours, and the 8th and nal one after 14,382 hours. Obtain
an estimate of the mean lifetime and test the hypothesis that it is 12,000
hours against the alternative that it is not.
Solution:
For this problem,
r = 62.397 + (10 8) 14.382 = 91.161

(21.77)

so that the estimate for is:

= 11.40

(21.78)

The test statistic, r is:

r = 182.322/12 = 15.194

(21.79)
20.025 (16)

= 6.91 and
And from the chi-square distribution, we obtain
20.975 (16) = 28.8. And now, since 15.194 does not lie in the rejection
region, we nd no evidence to reject the null hypothesis. We therefore
conclude that it seems reasonable to assume that the true mean lifetime
of this new brand of light bulb is 12,000 hours as specied.

21.6.2

The Weibull Model

When the failure rate function is either decreasing or increasing, as is the

case in the initial and nal periods of component lifetimes, the Weibull model
is more appropriate. The failure-time distribution, reliability function, and
failure rate (or hazard function) for this model were given earlier. From the
discussions in Chapter 8, we know that the mean of the Weibull pdf, which
in this case will correspond to the mean failure time, is given by:
= E(T ) =

1
(1 + 1/)

(21.80)

Life testing is aimed at acquiring data from which the population parameters,
and will be estimated. Unfortunately, unlike with the exponential model,
estimating these Weibull parameters can be tedious and dicult, requiring either numerical methods or old-fashioned graphical techniques that are based
on many simplifying approximations. Even more so than with the relatively
simpler exponential model case, computer software must be employed for carrying out parameter estimation, and hypothesis tests for the Weibull model.
Additional details lie outside the intended scope of this chapter but are
available in the book by Nelson (2003), which is highly recommended to the
interested reader.

Reliability and Life Testing

21.7

923

Summary and Conclusions

The exposure to the topic of reliability and life testing provided in this
chapter was designed to serve two purposes. First is the general purpose of
Part Vto showcase, no matter how briey, some substantial subject matters
that are based entirely on applications of probability and statistics. Second
is the specic purpose of illustrating how the reliability and the lifetimes
of components and systems are characterized and analyzed. The scope of
coverage was deliberately limited, but still with the objective of providing
enough material such that the reader can develop a sense of what these studies
entail. We presented reliability, for a component or a system, as a probability
the probability that the component or system functions as desired, for at least
a specied period of time. The techniques discussed for determining system
reliability given component reliabilities and system conguration produced
some interesting results, two of which are summarized below:
Product Law of Reliabilities: For systems consisting of n components
reliability, Rs ,
connected in series, each with reliability, Ri , the system*
n
is a product of the component reliabilities; i.e., Rs = i=1 Ri . Since
0 < Ri < 1, the reliability of a system of series components therefore
diminishes as the number of components increases.
Product Law of Unreliabilities: When the n components of a system are
arranged in parallel, the system unreliability, *
(1 Rs ), is a product of
n
the component unreliabilities; i.e., 1 Rs = i=1 (1 Ri ). Thus, the
reliability of a system of parallel components improves as the number of
components increases; the additional components simply act as redundant backups.
Computing the reliabilities of more complex systems requires reducing such
systems to a collection of simple modules and, in the case of systems with
cross-links, using a keystone and invoking Bayes theorem of total probability.
As far as specic models of failure times are concerned, we focused only on
the exponential and Weibull models, the two most widely used in practice. In
discussing failure time distributions and their characteristics, we were able to
revisit some of the special lifetime distributions presented earlier in Chapter 4
(especially the survival function and the hazard function) here in their more
natural habitats.
While reliability analysis depends entirely on probability, not surprisingly,
life testing, the experimental determination of component and system reliability characteristics, relies on statistical inference: estimation and hypothesis
testing. How these ideas are used in practice is illustrated further with the
end-of-chapter exercises and problems.

924

Random Phenomena

REVIEW QUESTIONS
1. What is the denition of the reliability of a component or a system?
2. What are the two factors that determine the overall reliability of a system consisting of several components?
3. What is the dening problem of system reliability?
4. In terms of system conguration, what is a simple system as opposed to a
complex system?
5. What is the product law of reliabilities, and to which system conguration does
it apply?
6. Why is system reliability for a series conguration independent of the order in
which the components are arranged?
7. What is the product law of unreliabilities, and to which system conguration does
it apply?
8. As n, the number of components in a series conguration increases, what happens
to Rs , system reliability? Does it increase or decrease?
9. As n, the number of components in a parallel conguration increases, what happens to Rs , system reliability? Does it increase or decrease?
10. What is a module?
11. What is the analysis technique for determining the reliability of series-parallel
systems?
12. What is a k-of-n parallel system?
13. Why is it more complicated than usual to determine the reliability of systems
with cross-links?
14. What special component designation is needed in analyzing the reliability of
systems with cross-links?
15. What is the survival function, S(t), and how is it related to the cumulative
distribution function, F (t)?
16. In lifetime studies, the cumulative distribution function, F (t), is the same as
what system characteristic?
17. What is the hazard function, h(t), and how is it related to the standard pdf, f (t)?

Reliability and Life Testing

925

18. Why is the failure rate (hazard function) curve known as the bathtub curve?
19. The rate in the failure rate is not with respect to time; it is with respect to
what?
20. What are the three distinct parts of the hazard function (failure rate) curve?
21. What is the distribution of failure times for random chance failure, with constant failure rates, ?
22. What is the reliability function for components or systems with exponential
failure-time distributions?
23. What is the denition of mean-time-between-failures (MTBF)?
24. What is a decreasing failure rate (DFR) as opposed to an increasing failure rate
(IFR) model?
25. What is the reliability function for components or systems with the Weibull
failure-time distribution?
26. What is the failure time distribution for a series conguration of n components
each with an exponential failure-time distribution?
27. What is the relationship between the MTBF of a series conguration of n components each with exponential failure-time distributions, and the MTBFs of the
components?
28. In what way is the law of diminishing returns manifested in the MTBF for a
parallel conguration of n identical systems with exponential reliability?
29. What is the MTBF for an m-of-n parallel system with exponential component
reliabilities?
30. What is the failure-time distribution for a series conguration of n components
each with a Weibull failure-time distribution?
31. What is life testing?
32. What is a replacement test, a non-replacement test, a truncated test, or an accelerated test?
33. In life testing, what is the accumulated life until the r th failure?
34. What test statistic is used in life testing with the exponential model? What is
its sampling distribution?

926

Random Phenomena

EXERCISES AND APPLICATION PROBLEMS

21.1 The facility for storing blood in a blood bank located in a remote hospital
consists of a primary refrigerator RF1 , a backup, RF2 , and a set of gasoline engine generators, G1 the primary generator, G2 the rst backup and G3 yet another
backup. The primary system is RF1 connected to G1 with G2 as backup; the entire
primary system is backed up by RF2 connected to G3 , as shown in Fig 21.7.

G1
RF1
G2
RF2

FIGURE 21.7: Blood storage system

(i) If the refrigerators are identical, with reliabilities, R = 0.95, and the generators
have the following reliabilities: RG1 = 0.85 = RG2 , and RG3 = 0.75, compute the
overall system reliability.
(ii) It has been argued that the backup generator G3 , also needs a back-up (just
like G2 was a backup for G1 ) so that the full backup system will replicate the full
primary system exactly. If a generator G4 with the same reliability as that of G3
is added as prescribed, what is the resulting percentage increase in overall system
reliability?
21.2 A single distributed control system (DCS) used to carry out all the control
functions in a manufacturing facility has a reliability of 0.99. DCS manufacturers
usually recommend purchasing not just a single system, but a complete system with
built-in back-up modules that function in standby mode in parallel with the primary
module. (i) If it is desired to have a complete system with a reliability of 0.999, at
least how many modules in parallel will be required to achieve this objective?
(ii) If the reliability is increased to 0.9999 how many modules will be required?
21.3 The following reliability block diagram is for a heat exchange system employed
in a nuclear power generating plant. Three heat exchangers HX1 , HX2 and HX3
are each equipped with control valves and ancillary control systems for maintaining
temperature control. The reliability, R = 0.9, associated with each heat exchanger
is due to the possible failure of the control valves/control system assembly. For
simplicity, these components are not shown explicitly in the diagram. The three
heat exchangers are supplied with cooling water by three pumps, P1 , P2 and P3 ,
each with reliability 0.95. The pump characteristics are such that any one of them
is sucient to service all three heat exchangers; however, the power plant requires
2 of the three heat exchangers to function. The water supply may be assumed to be
always reliable.
(i) Determine the system reliability under these conditions.

Reliability and Life Testing

Pumps

W
Water Supply

927

Heat Exchangers

HX1

HX2

HX3

2/3

FIGURE 21.8: Nuclear power plant heat exchanger system

(ii) If the power plant were redesigned such that only one of the heat exchangers is
required, by how much will the system reliability increase?
21.4 Rs , the reliability of the system shown below, was obtained in the text using
component C3 as the keystone.
Pump 1

Valve 1
(Air-to-Open)

Pump 2

Valve 2
(Air-to-Open)

FIGURE 21.9: Fluid ow system with a cross link (from Fig 21.5)
(i) Choose C2 as the keystone and obtain Rs again. Compare your result with Eqn
(21.30) in the text.
(ii) Now choose C1 as the keystone and repeat (i). Compared with the derivation
required in (i), which keystone choice led to a more straightforward analysis?
(iii) Given specic component reliabilities for the system as: R1 = 0.93; R2 =
0.99; R3 = 0.93; R4 = 0.99, where Ri represents the reliability of component Ci ,
compare the reliability of the system with and without the cross-link and comment
on how the presence of the cross-link aects this specic systems reliability.
21.5 An old-fashioned re alarm system consists of a detector D and an electrically
operated bell, B. The system works as follows: if a re is detected, a circuit is completed and the electrical signal reaching the bell will cause it to ring. The reliability
of the detector is 0.9 and that of the bell is 0.995.
(i) What is the reliability of the complete re alarm system?
(ii) If another identical detector and bell combination is installed in standby, by how
much will the reliability of the new augmented re alarm system improve?
(iii) It has been recommended, as a cost-saving measure, to purchase for the back-

928

Random Phenomena
D

B2
Backup Bell

Backup
Detector

FIGURE 21.10: Fire alarm system with back up

up system proposed in (ii), a detector D2 , that is identical to the primary detector,

but a less reliable bell B2 (R = 0.85) and wire the overall system such that D2 is
connected to the primary bell B via a cross-link, as shown in the diagram below,
Fig 21.10. Determine the reliability of the new system with this conguration.
21.6 Refer to Problem 21.5, part (iii). Determine the system reliability if the
system wiring is adjusted so that in addition, a second cross-link connects the primary detector D to the less reliable back-up bell, B2 . Does the system reliability
increase
or decrease?

21.7 The condenser system used to condense volatile organic compounds (VOCs)
out of a vapor stream in a petrochemical facility consists of a temperature sensor, S,
an electronic controller, C, and a heat exchanger, HX. So long as the temperature of
the vapor stream through the condenser is maintained by the control system below
the boiling point of the lowest boiling VOC, there will be no release of VOCs to the
atmosphere. The reliability of the sensor is RS , that of the controller is RC , and the
entire heat exchanger ensemble has a reliability of RHX . The following conguration
is used for the particular system in question, a full scale back up with cross-links.

HX1

HX2

FIGURE 21.11: Condenser system for VOCs

(i) Determine the system reliability, assuming that similar components have identical reliabilities.
(ii) Given the following component reliabilities, RS = 0.85; RC = 0.9; RHX = 0.95,
determine the overall system reliabilities.

Reliability and Life Testing

929

21.8 Pottmann. et al., (1996)2 presented the following simplied block diagram for
the mammalian blood pressure control system. The baroreceptors are themselves
systems of pressure sensors, and the sympathetic and parasympathetic systems are
separate control systems that are part of the nervous systems. These subsystems are
not entirely all mutually independent; for the purposes of this problem, however,

they can be considered
as such. Consider an experimental rat for which the indicated

Type II
Baroreceptor

Sympathetic
System
Pressure
Setpoint

Cardiovascular
System

Heart

Parasympathetic
System

Carotid Sinus
Pressure

Heart
Rate

Type I
Baroreceptor

FIGURE 21.12: Simplied representation of the control structure in the baroreceptor

reex
baroreceptor reex components have the following reliabilities:
Sensors: Baroreceptor Type I, RB1 = 0.91; Type II, RB2 = 0.85;
Controllers: Sympathetic system, RS = 0.95; Parasympathetic system, RP =
0.92;
Actuators: Heart, RH = 0.99; Cardiovascular system, RC = 0.95
Rearrange the control system block diagram into a reliability block diagram by starting with the two systems representing the sensors at the left end (and the carotid
sinus pressure signal as the input), then connect the sensors in parallel to the appropriate controllers, (the pressure setpoints to the controllers may be omitted for
the reliability analysis), ultimately end with the carotid sinus pressure signal from
the cardiovascular system.
(i) Determine the system reliability as congured.
(ii) If the system operated only via the Type I receptors-Parasympathetic system
connection, determine the system reliability under these conditions.
(iii) Repeat (ii) under the conditions that the system operated only via the alternative Type II receptors-Sympathetic system connections. Which of the isolated
systems is more reliable? By how much is the overall system reliability improved as
a result of the parallel arrangement?
21.9 In designing a safety system for a high pressure ethylene-copolymers manufacturing process, the design engineers had to take the following factors into consideration: First, no safety system, no matter how sophisticated, can be perfect in
2 Pottmann, M., M. A. Henson, B. A. Ogunnaike, and J. S. Schwaber, (1996). A parallel
control strategy abstracted from the baroreceptor reex, Chemical Engineering Science,
51 (6), 931-945.

930

Random Phenomena

preventing accidents; the higher the safety system reliability, the lower the risk,
but it will never be zero. Second, attaining high system reliability is not cheap,
whether it is realized with individual components with high reliability, or by multiple redundancies. Last, but not least, even though high reliability is expensive,
the repercussions of a single safety catastrophe with this manufacturing process is
enormous in nancial terms, besides the lingering eects of bad publicity that can
take decades to overcome, if ever.
Engineers designing a safety system for a specic plant were therefore faced with
a dicult optimization problem: balancing the high cost of a near-perfect system
against the enormous nancial repercussions of a single catastrophic safety event.
But an optimum solution can be obtained as follows.
Let C0 be the cost of a reference system with mediocre reliability of 0.5 (i.e., a
system with a 50% probability of failure). For the particular process in question, the
total cost of installing a system with reliability Rs is given by:
R =

C0 Rs
1 Rs

(21.81)

Note that in the limit as Rs 1, R . There is a point of diminishing returns,

however, above which the incremental cost does not result in commensurate increase
in reliability; i.e., eking out some additional increase in reliability beyond a certain
value is achieved at a disproportionately high cost.
Now let the cost of system failure be CF , which, by denition, is a substantial
sum of money. For a system with reliability Rs , the expected cost of failure is,
F = CF (1 Rs )

(21.82)

where, of course, (1Rs ) is the probability of system failure. Note that, as expected,
this is a monotonically (specically, linearly) decreasing function of system reliability.
We may now observe that the ideal system will be one with a reliability that
minimizes the total expected costs, achieving a high enough reliability to reduce the
risk of failure, but not so much that the cost of reliability is prohibitive.
(i) Determine such a reliability, R , by minimizing the objective
= F + R = CF (1 Rs ) +

C0 Rs
1 Rs

(21.83)

the total expected costs; show that the desired optimum reliability is given by:
)

R =1
=1
(21.84)
CF
where is the ratio of the base reference cost of reliability to the cost of failure.
Discuss why this result makes sense by examining the prescribed R as a function
of the ratio of the two costs, C0 and CF .
(ii) For a particular system where C0 = $20, 000 and for which CF = $500 million,
determine R , and the cost of the safety system whose reliability is R .
(iii) If the cost of the catastrophe were to double, determine the new value for R ,
and the corresponding cost of the recommended safety system.
(iv) If a single composite system with reliability R determined in (ii) above is unavailable, but only a system with reliability 0.85, how many of the available systems
will be required to achieve the desired reliability? How should these be congured?

Reliability and Life Testing

931

21.10 For certain electronic components, survival beyond an initial period from
t = 0 to t = is most crucial because thereafter, the failure rate becomes virtually negligible. For such cases, the hazard function (i.e., the failure rate) may be
approximated as follows:

(1 t/ ) 0 < t <
h(t) =
(21.85)
0
otherwise
Obtain an expression for f (t), the failure time distribution, and the corresponding cumulative distribution, F (t). From these results show that for such electronic
components, the reliability function is given by:
R(t) = et(1t/2 )
during the initial period, 0 < t < , and that thereafter (for t > ), it is:
R(t) = e /2
21.11 The time to failure, T , of an electronic component is known to be an exponentially distributed random variable with pdf

et ; 0 < x <
f (t) =
0;
elsewhere
where the failure rate, = 0.075 per 100 hours of operation.
(i) If the component reliability function Ri (t) is dened as
Ri (t) = P (T > t)

(21.86)

the probability that the component functions at least up until time t, obtain an
explicit expression for Ri (t) for this electronic component.
(ii) A system consisting of two of such components in parallel functions if at least
one of them functions; again assuming that both components are identical, nd the
system reliability Rp (t) and compute Rp (1000), the probability that the system survives at least 1,000 hours of operation.
21.12 The failure time (in hours) for 15 electronic components is given below:
337.0
408.9

290.5
183.4

219.7
174.2

739.2
330.8

900.4
102.2

36.7
731.4

348.6
73.5

44.7

(i) First conrm that the data is reasonably exponentially distributed and then obtain an estimate of mean life time.
(ii) The company that manufactures the electronic components claims that the mean
life time is 400 hours. Test this hypothesis against the alternative that the mean lifetime is lower. What is the conclusion of this test?
(iii) Using the estimated mean life time to determine the exponential population
mean failure rate, , compute the probability that a system consisting of two of
these components in parallel functions beyond 400 hours.
21.13 Refer to Problem 12.12. This time, consider that the life test was stopped,
by design, after 500 hours. Repeat the entire problem and compare the results. How
close to the full data results are the results from the truncated test?

932

Random Phenomena

Chapter 22
Quality Assurance and Control

22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.2 Acceptance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.2.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics of Sampling Plans . . . . . . . . . . . . . . . . . . . . . . .
22.2.2 Determining a Sampling Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Operating Characteristic (OC) Curve . . . . . . . . . . . . . . . . . . . . .
Approximation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Characteristics of the OC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.3 Process and Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.3.1 Underlying Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.3.2 Statistical Process Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.3.3 Basic Control Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Shewhart Xbar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The S-Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Variations to the Xbar-S Chart: Xbar-R, and I & MR Charts .
The P-Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The C-Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.3.4 Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Western Electric Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CUSUM Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EWMA Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.4 Chemical Process Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.4.1 Preliminary Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.4.2 Statistical Process Control (SPC) Perspective . . . . . . . . . . . . . . . . . . . .
22.4.3 Engineering/Automatic Process Control (APC) Perspective . . . . .
22.4.4 SPC or APC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
When SPC is More Appropriate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
When APC is More Appropriate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.5 Process and Parameter Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.5.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.5.2 A Theoretical Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PROJECT ASSIGNMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. Tracking the Dow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2. Diabetes and Process Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3. C-Chart for Sports Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

934
935
936
936
937
938
940
942
943
943
944
944
946
947
948
950
954
957
958
958
959
960
961
964
964
965
965
967
968
968
969
969
970
971
972
975
975
976
976

Find out the cause of this eect

or rather say, the cause of this defect;
for this eect defective comes by cause.
William Shakespeare (15641616); Hamlet, II, ii, 101

933

934

Random Phenomena

Mass production, a uniquely 20th century invention, unquestionably transformed industrial productivity by making possible the manufacture of large
quantities of products in a relatively short period of time. But making a product faster and making lots of it does not necessarily mean much if the product
is not made well. If anything, making the product well every time and all the
time, became a more challenging endeavor with the advent of mass production.
It is only natural, therefore, that assuring the quality of mass produced goods
has since become an integral part of any serious manufacturing enterprise. At
rst, acceptance sampling was introduced by customers to protect themselves
from inadvertently receiving products of inferior quality and only discovering
these defective items afterwards. Before a manufactured lot is accepted by
the consumer, the strategy calls for a sample to be tested rst, with the results of the test serving as a rational basis for deciding whether to accept the
entire lot or to reject it. Producers later incorporated acceptance sampling
into their product release protocols to prevent sending out inferior quality
products. However, such an after-the-fact strategy was soon recognized as
inecient and, in the long run, too expensive. The subsequent evolution of
quality assurance through process and quality control (where the objective is
to identify causes of poor quality and correct them during production) to the
total quality management philosophy of zero defects (which requires, in addition, the design of processes and process operating parameters to minimize the
eect of uncontrollable factors on product quality) was rapid and inevitable.
A complete and thorough treatment of quality assurance and control requires more space than a single chapter can aord. As such, our objective
in this chapter is more modestly set at providing an overview of the key
concepts underlying three primary modes of quality assurance. We discuss
rst acceptance sampling, from the consumers as well as the producers perspectives; we then discuss in some detail, process and quality control, where
the focus is on the manufacturing process itself. This discussion covers the
usual terrain of statistical process control charts, but adds a brief section
on engineering/automatic process control, comparing and contrasting the two
philosophies. The nal overview of Taguchi methods is quite brief, providing
only a avor of the concepts and ideas.

22.1

Introduction

In modern manufacturing processes with mass production capabilities, the

primary objective is to make products that meet customer requirements as
economically as possible. Ideally, the customer sets a specic desired target
value, y , for a particular measurable indicator of product characteristic, Y ,
that the manufacturer must match; for example, a Toughness of 140 J/m3
required of a certain polymer resin; ball bearings of 2 mm outer diameter;

Quality Assurance and Control

935

computer laptop batteries with a lifetime of 5 years, etc. Because of inherent

and unavoidable variability, however, the customer typically species in addition to the target value, y , a tolerance limit, say , so that the product is
then said to meet customer requirements when the specic measured product
characteristic value, y, lies in the interval y , otherwise it does not meet
the objective.
Unavoidable variability in raw materials, in process and environmental
conditions, in human operators, etc., eventually manifest as variability in the
nal product characteristics. The primary issue of concern to manufacturers
is therefore how to ensure the nal product quality given such unavoidable
variabilities. Products that do not meet customer requirements are usually
rejected, with adverse nancial consequences to the producer; in addition, the
psychological handicap of being associated with inferior quality can often
be dicult to overcome.
Quality assurance and control is a subject matter devoted to the techniques
and tools employed in modern manufacturing processes to ensure that products are manufactured to specication in spite of unavoidable variabilities.
And few applications of probability and statistics have penetrated and transformed an industry to the extent that quality assurance and control has the
manufacturing industry. The problems associated with assuring the quality of
mass produced products may be categorized as follows:
1. Given a large lot of manufactured items, how do we assure product quality before sending them out (the producers concern) or before accepting
them (the consumers concern)? This is the acceptance sampling problem, by denition, a post-production problem.
2. Given a production process and operating procedure, how do we operate
the process to ensure that the resulting manufactured product meets
the desired specications? This is the process control problem, by
denition, a during-production problem.
3. Can the production process be designed and the operating procedure
formulated such that the resulting manufacturing operation is robust to
the sort of intrinsic variabilities that propagate and ultimately translate
into unacceptable variability in product quality? This is the process (or
parameter) design problem, a pre-production problem.
Traditional quality assurance focused on Problem 1 then evolved to include Problem 2; more recently, Problem 3 has received greater attention as
signicant results from successful applications become widely publicized. In
the ideal case, if Problem 3 is solved in the pre-production stage, and Problem 2 is handled eectively during production, then there will be fewer, if
any, post-production rejections, and Problem 1 becomes a non-issue. This is
the prevalent total quality management view of quality assurance. The rest of
the chapter is organized around each of these problems in the order presented
above.

936

Random Phenomena

22.2

Acceptance Sampling

22.2.1

Basic Principles

Acceptance sampling, as the name implies, is a procedure for sampling

a batch of products to determine, objectively, whether or not to accept the
whole batch on the basis of the information gathered from the sample. It has
traditionally been considered as a procedure used by the customer to ascertain the quality of a procured lot before accepting it from the manufacturer;
however, it applies equally well to the producer, who checks manufactured lots
before sending them out to customers.
The dening characteristic is that for a large lot of items, exhaustive inspection is not only infeasible, it is also impractical and quite often unaffordable. This is especially so when the quality assurance test is destructive.
The classic example is the very application that led to the development of
the technique in the rst place: U.S. military testing of bullets during WW
II. Clearly, every bullet cannot be tested (as there will be nothing left for
soldiers to use!) but without testing any at all, there is no way to ascertain
which lot will perform as desired and which will not. As a result, the decision
on lot quality must be based on the results of tests conducted on samples
from the lot, making this a problem perfectly tailor-made for the application
of statistical inference theory. A typical statement of the acceptance sampling
problem is:
Let N be the total number of items in a manufactured lot for which
is the true but unknown fraction of defective (or non-conforming)
items; a sample of n items drawn from the lot and tested; and x
is the actual number of defective (or non-conforming) items found
in the sample (a realization of the random variable, X). The lot is
accepted if x c, a predetermined critical value, the acceptance
number; otherwise the lot is rejected.
For a given lot size, N , the pair, (n, c) determines a sampling plan. Acceptance sampling therefore involves determining a sampling plan, implementing
it to determine x and deciding whether to accept or reject the lot on the basis
of how x compares with c. How n, the sample size, and c, the acceptance
number, are determined is at the heart of the theory of acceptance sampling.
Before discussing how a sampling plan is generated, here are some basic
concepts and terminology used in acceptance sampling.
Basic Characteristics of Sampling Plans
1. Product Characterization by Attributes and Variables
The acceptability of a tested item is either based on a discrete (binary or
count) criterion or on a continuous one. For example, acceptance is based on

Quality Assurance and Control

937

a discrete binary criterion for a batch of electronic chips evaluated on the

basis of whether the sampled items function or do not function. For a batch
of silicon wafers evaluated on the basis of the number of aws each contains,
acceptance is based on a discrete count criterion. These discrete quantities are
known as attributes and present what is known as an acceptance sampling
by attribute problem. On the other hand, a batch of polymer resins evaluated
on the basis of continuous measurements such as Toughness, in J/m3 , or
Density (in kg/m3 ), presents an acceptance sampling by variable problem.
The batch is accepted only if the values obtained for these product variables
lie in a prescribed range.

2. Acceptance and rejection criteria

A base line acceptance requirement for the fraction (or percent) of defective
items in a manufactured lot is known as the Acceptable Quality Level (AQL).
This is the value 0 such that if the fraction of defectives found in the lot after
inspection, x/n, is such that x/n 0 , then the lot is denitely acceptable.
The complementary quantity is the Rejectable Quality Level (RQL) (also
known as the Lot Tolerance Percent Defective (LTPD)): this is the value,
1 > 0 , representing a high defect level that is considered unacceptable. If
the fraction of defectives found in the lot is such that x/n 1 , then the lot
is denitely unacceptable.
While the lot is denitely acceptable if x/n 0 , (i.e., the actual fraction
(or percent) of defectives is less than the AQL), and while it is denitely unacceptable if x/n 1 (i.e., the actual fraction defectives is greater than the
RQL), in between, when 0 x/n 1 , the lot is said to be barely acceptable,
or of indierent quality.

3. Types of sampling plans

When the accept/reject decision is to be based on a single sample, the sampling plan is appropriately known as a single sampling plan. This is the
most commonly used plan, but not always the most ecient. With this plan,
if x < c, the lot is accepted; if not, it is rejected. With a double sampling plan,
a small initial sample, n1 , is taken and the number of defectives in the sample,
x1 , determined; the lot is accepted if x1 /n1 0 or rejected x1 /n1 1 . If the
fraction defective is in between, take a second sample of size n2 and accept or
reject the lot on the basis of the two samples. The multiple sampling plan
is a direct extension of the double sampling plan.
The ideal sampling plan from the consumers perspective will result in a
low probability of accepting a lot for which 1 . For the producer on the
other hand, the best sampling plan is one that results in a high probability of
accepting a lot for which 0 .

938

22.2.2

Random Phenomena

Determining a Sampling Plan

Let the random variable X represent the number of defective items found
in a sample of size n, drawn from a population of size N whose true but
unknown fraction of defectives is . Clearly, from Chapter 8, we know that X
is a hypergeometric random variable with pdf:
N N N
f (x) =

Nnx

(22.1)

The total number of defectives in the lot, of course, is

ND = N ; ND = 0, 1, 2, . . . , N

(22.2)

At the most basic level, the problem is, in principle, that of estimating
from sample data, and testing the hypothesis:
H0 : 0

(22.3)

Ha : > 0

(22.4)

against the alternative

As with all hypothesis tests, is the probability of committing a Type I
error, in this case, rejecting an acceptable lot; this is the producers risk.
On the other hand, , the probability of committing a Type II error, not
rejecting an unacceptable lot, is the consumers risk. In practice, sampling
plans are designed to balance out these two risks using an approach based on
the operating characteristic curve as we now discuss.
The Operating Characteristic (OC) Curve
Let A be the event that the lot under consideration is accepted. From Eq
(22.1) as the probability model for the random variable, X:
P (A)

= P (X c) =
=

f (x)

x=0
N N N
c

x
Nnx

n
x=0

(22.5)

Note that P (A) depends on N, and c.

First, let us consider the analysis problem in which N , the lot size, is
known and c, the acceptance number, is specied, and we simply wish to
compute the probability of accepting the lot for various values of . Under
these circumstances, it is customary to rewrite P (A) as P (A|), so that:
N N N
c
c

x
f (x|) =
P (A|) =
(22.6)
Nnx

x=0

x=0

We may now note the following about this expression:

Quality Assurance and Control

939

Operating Characteristic (OC) Curve

Sample Size = 32, Acceptance Number = 3
0.044

0.198

Probability of Acceptance

1.0
D

0.95

0.8
0.6
0.4
0.2
E

0.1

0.0

RQL T 1

AQL
0.0

0.1

0.2
Lot Proportion Defective

0.3

0.4

FIGURE 22.1: OC Curve for a lot size of 1000, sample size of 32 and acceptance
number of 3: AQL is the acceptance quality level; RQL is the rejection quality level.

1. Clearly if = 0 (no defective item in the lot), the probability of acceptance is 1; i.e.,
P (A|0) = 1
(22.7)
2. As increases, P (A|) decreases; in particular, if = 1,
P (A|1) = 0

(22.8)

3. A plot of P (A|) as a function of provides information regarding the

probability of lot acceptance given the fraction of defectives in the lot.
4. Since = ND /N and ND = 0, 1, 2, . . . N , then 0 < < 1 can only
actually take on values for these discrete values of ND , i.e., P (A|) is
dened only for values corresponding to ND = 0, 1, 2, . . . N .
Nevertheless, it is customary to connect the valid (P (A|), ) ordered pairs
with a smooth curve, to obtain the operating characteristic curve. An example
is shown in Fig 22.1 for the case with N = 1000; n = 32; c = 3.
What we have presented above is the analysis problem, showing how, given
N, n and c, we can obtain the acceptance probability P (A|) as a function of
, and generate the operating characteristic (OC) curve. In actual fact, the
most important use of the OC curve is for generating sampling plans. This is
the design problem stated as follows:
Given N , the lot size, determine from P (A|), feasible values of n
and c that balance the consumers and producers risks.

940

Random Phenomena

From a strictly algebraic perspective, determining n and c requires generating

from Eq (22.6), two equations with these two quantities as the only unknowns,
and solving simultaneously. This is achieved as follows: let some value p0 be
selected as the probability of acceptance for lots with defective fraction 0 ,
and let the probability of acceptance for lots with defective fraction 1 be
selected as p1 . In principle, any arbitrary set of values selected for any of these
4 parameters (p0 , 0 ; p1 , 1 ), should give us two equations in two unknowns
that can be solved for n and c. However, it makes more sense to select these
parameters judiciously to achieve our objectives. Observe that if these values
are selected as follows:
p0
p1

=
=

1 ;

(22.9)
(22.10)

where is the producers risk, and , the consumers risk, and if we retain the
denitions given above for 0 , the AQL, and 1 , the RQL, then the following
two equations:
1=
=

c

x=0
c

x=0

f (x|0 ) =
f (x|1 ) =

c

x=0
c

N 0 N N 0
x

N nx

n
N 1 N N 1
x
N nx
n
x=0

(22.11)

(22.12)

locate two points on the OC curve such that (i) there is a probability of
rejecting a lot with true defective fraction, , that is less than the AQL, 1 ,
and (ii) a probability of accepting (more precisely, not rejecting) a lot with
true defective fraction, , that is higher than the RQL. These two equations
therefore allow simultaneous consideration of both risks.
Given N, 0 and 1 , the only unknowns in these equations are n and c; x
is an index that runs from 0 to c. The simultaneous solution of the equations
produces the sampling plan. In general, there are no closed form analytical
solutions to these equations; they must be solved numerically with the computer. If the specied values for 0 , 1 , and are reasonable such that a
feasible solution of an n, c pair exists, the obtained solution can then be used
to generate the OC curve for the specic problem at hand. Otherwise, the
specied parameters will have to be adjusted until a feasible solution can be
found.
Thus, to generate a sampling plan, one must specify four parameters: (i)
0 , the acceptable quality level (AQL); (ii) 1 , the rejectable quality level
(RQL), along with (iii) , the producers risk, and (iv) , the consumers risk.
The resulting sampling plan is the pair of values n and c, used as follows:
the number of samples to take from the lot and test is prescribed as n; after
testing, x, the number of defectives found in the sample is compared with c;
if x c the lot is accepted; if not the lot is rejected.

Quality Assurance and Control

941

Approximation Techniques
Before presenting an example to illustrate these concepts, we note that
the computational eort involved in solving Eqs (22.11) and (22.12) can be
reduced signicantly by employing well-known approximations to the hypergeometric pdf. First, recall that as N the hypergeometric pdf tends to
the binomial pdf, so that for large N , Eq (22.6) becomes
c

n x
(22.13)
P (A|)
(1 )nx
x
x=0
which is signicantly less burdensome to use, especially for large N . This is
the binomial approximation OC curve.
When the sample size, n, is large, and hence Eq (22.13) itself becomes
tedious, or when the quality assessment is based not on the binary go/no
go attribute of each tested item but on the number of defects per item, the
Poisson alternative to Eq (22.13) is used. This produces the Poisson OC curve.
Recall that as n and 0 but in such a way that n = , the binomial
pdf tends to the Poisson pdf; then under these conditions, Eq (22.13) becomes
P (A|)

c

(n)x en
x!
x=0

(22.14)

Example 22.1: SAMPLING PLAN AND OC CURVE FOR

ELECTRONIC CHIPS
An incoming lot of 1000 electronic chips is to be evaluated for acceptance on the basis of whether the chips are functioning or not. If the
lot contains no more than 40 defectives, it is deemed acceptable; if it
contains more than 200 defectives, it is not acceptable. Determine a
sampling plan to meet these objectives with an -risk of 0.05 and a
-risk of 0.1. Plot the resulting OC curve.
Solution:
For this problem, we are given the AQL as 0 = 0.04 and the RQL as
1 = 0.2 along with the standard and risks. The lot size of 1000 is
more than large enough to justify using the binomial approximation to
determine n and c. The MINITAB Quality Tools feature can be used to
solve this problem as follows: The sequence: Stat > Quality Tools >
Acceptance Sampling by Attribute > opens a dialog box for specifying the problem characteristics. The objective is to Create a Sampling
Plan (not Compare user-defined sampling plans); the measurement
type is Go / no go (defective) (as opposed to number of defects);
the Units for quality levels is Proportion defective (as opposed
to percent defective or defectives per million). The remaining
boxes are for the quartet of problem parameters: the AQL, RQL, risk, -risk, and, in addition, for the lot size, N . MINITAB also provides options for generating several graphs, of which only the OC curve
is selected for this problem. The MINITAB results are:

942

Random Phenomena
Acceptance Sampling by Attributes
Measurement type: Go/no go
Lot quality in proportion defective
Lot size: 1000
Use binomial distribution to calculate probability of acceptance
Acceptable Quality Level (AQL)
0.04
Producers Risk (Alpha)
0.05
Rejectable Quality Level (RQL or LTPD)
Consumers Risk (Beta)
Generated Plan(s)
Sample Size
Acceptance Number

0.2
0.1

32
3

Accept lot if defective items in 32 sampled <= 3; Otherwise reject.

The OC curve is actually the one shown previously Fig 22.1.

If the AQL and RQL specications are changed, the sampling plan will
change, as will the OC curve. For example, if in Example 22.1 the AQL is
changed to 0.004 (only 4 items in 1000 are allowed to be defective for the lot
to be acceptable) and the RQL changed to 0.02 (if 20 items or more in 1000
are defective, the lot will be unacceptable), then the sampling plan changes
to n = 333 while the acceptance number remains at 3; i.e., 333 samples will
be selected for inspection and the lot will be accepted only if the number of
defectives found in this sample is 3 or fewer. The resulting OC curve is shown
in Fig (22.2) where the reader should pay close attention to the scale on the
x-axis, compared to that in Fig (22.1).
Characteristics of the OC Curve
Upon some reection, we see that the shape of the OC curve is actually
indicative of the power of the sampling plan to discriminate between good
and bad lots. The steeper the OC curve the better it is at separating good
lots from bad; and the larger the sample size, n, the steeper the OC curve.
(Compare, for example, Figs 22.1 and 22.2 on the same scale.)
The shape of the ideal OC curve is a perfect narrow rectangle around the
value of the AQL, 0 : every lot with = 0 will be accepted and every lot
with > 0 rejected. However, this is unrealistic for many reasons, not least of
all being that to obtain such a curve will require almost 100% sampling. The
reverse S-shaped curve is more common and more practical. Readers familiar
with the theory of signal processing, especially lter design, may recognize
the similarity between the OC-curve and the frequency characteristics of lters: the ideal OC curve corresponds to a notch lter while the typical OC
curves correspond to low pass, rst order lters with time constants of varying
magnitude.
Finally, we note that the various discussion of Power and Sample size

Quality Assurance and Control

943

Operating Characteristic (OC) Curve

Sample Size = 333, Acceptance Number = 3

Probability of Acceptance

1.0
0.8
0.6
0.4
0.2
0.0
0.00

0.01

0.02
Lot Proportion Defective

0.03

0.04

FIGURE 22.2: OC Curve for a lot size of 1000, generated for a sampling plan for
an AQL= 0.004 and an RQL = 0.02, leading to a required sample size of 333 and
acceptance number of 3. Compare with the OC curve in Fig 22.1.

in Chapter 15 could have been framed in terms of the OC curve; and in fact
many textbooks do so.

Other Considerations

There are other issues associated with acceptance sampling, such as Average Outgoing Quality (AOQ), Average Total Inspection (ATI), and the development of acceptance plans for continuous measures of quality and the
concept of the acceptance region; these will not be discussed here, however.
It is important for the reader to recognize that although important from a
historical perspective, acceptance sampling is not considered to be very costeective as a quality assurance strategy from the perspective of the producer.
It does nothing about the process responsible for making the product and
has nothing to say about the capability of the process to meet the customers
quality requirements. It is an after-the-fact, post-production strategy that
cannot be the primary tool in the toolbox of a manufacturing enterprise that
is serious about producing good quality products.

944

22.3
22.3.1

Random Phenomena

Process and Quality Control

Underlying Philosophy

For a period of time, industrial quality assurance was limited to acceptance

sampling: inspecting nished products and removing defective items. From the
perspective of the manufacturer, however, to wait until production is complete
and then rely on inspection to eliminate poor quality is not a particularly
sound strategy. One cannot inspect quality into the product.
It is far more ecient that during production, one periodically assesses
the product quality via quantitative measures, and analyzes the data appropriately to develop a clear picture of the status of both the process and the
product. If the product quality is acceptable, then the process is deemed to
be in control and no action is necessary; otherwise the process is deemed
out-of-control and corrective action is taken to restore normal operation.
This is the underlying philosophy behind Process Control as a quality assurance strategy. There are two primary issues to be addressed in implementing
such a strategy:
1. How should the true value of the process and/or product quality status
be determined? Clearly one cannot sample and analyze all the items
produced, not with discrete-parts manufacturing and denitely not in in
the case of continuous production of, say, specialty chemicals. It is typical
to establish a product quality control laboratory where samples taken
periodically from the manufacturing process are analyzed; inference can
then be drawn from such measurements about the process at-large.
2. How should corrective action be determined? As we discuss shortly, this
is the central issue that dierentiates statistical process control (SPC)
from engineering/automatic process control.

22.3.2

Statistical Process Control

Statistical Process Control is a popular methodology for implementing

the strategy outlined above. It is the application of statistical methods for
monitoring process performance over time, enabling the systematic detection
of the occurrence of special cause events that may require corrective action
in order to maintain the process in a state of statistical control.
A process (more precisely, a process variable) is said to be in statistical
control when the process variable of interest is statistically stable, and ontarget. By statistically stable, we mean that the true process characteristics
(typically mean, , and standard deviation, ) are not changing drastically
with time; by on-target, we mean that the true process variable mean value

Quality Assurance and Control

945

, exactly equals desired target value 0 (or the historical, long-term average
value).
At the most fundamental level, therefore, statistical process control involves taking a representative process variable whose value, Y , is either a
direct measure of the product quality of interest, or at least related to it,
and assessing whether the observed value, y, is stable and not signicantly
dierent from 0 . Because of inherent variability associated with sampling,
and also with the determination of the measured value itself, this problem requires probability and statistics. In particular, observe that one can pose the
question: Is y signicantly dierent from 0 ? in the form of the following
hypothesis test:
H0 :
Ha :

Y = 0
Y = 0

(22.15)

a problem we are very familiar with solving, provided an appropriate probability model is available for Y . In this case, we do not reject the null hypothesis,
at the signicance level of , if (YL Y YU ), where the values YL and YU
at the rejection boundary are determined from the sampling distribution such
that:
P (YL Y YU ) = 1
(22.16)
This equation is the foundation of one of an iconic characteristic of SPC;
it suggests a convenient graphical technique involving 3 lines:
1. A center line for 0 , the desired target for the random variable, Y ;
2. An Upper Control Limit (UCL) line for YU ; and
3. A Lower Control Limit (LCL) line for YL
on which each acquired value of Y is plotted. Observe then that a point falling
outside of these limits signals the rejection of the null hypothesis in favor of
the alternative, at the signicance level of , indicating an out-of-control
status. A generic SPC chart of this sort is shown in Fig 22.3 where the sixth
data point is out of limits. Points within the control limits are said to show
variability attributable to common cause eects; special cause variability
is considered responsible for points falling outside the limits.
In traditional SPC, when an out-of-control situation is detected, the
recommended corrective action is to nd and eliminate the problem, the
practical implementation of which is obviously process-specic so that the
instruction cannot be more specic than this. But in the discrete-parts manufacturing industry, there is signicant cost associated with nding and correcting problems. There is therefore signicant incentive to minimize false
out-of-control alarms.
Finally, before beginning the discussion of specic charts, we note that the
nature of the particular quantity Y that is of interest in any particular case

946

Random
Phenomena

YU (UCL)

YL (LCL)

FIGURE 22.3: A generic SPC chart for the generic process variable Y indicating a
sixth data point that is out of limits.

clearly determines the probability model underlying the chart, which in turn
determined how YU and YL are determined. The ensuing discussion of various
SPC charts is from this perspective.

22.3.3

Basic Control Charts

Control charts are graphical (visual) means of monitoring process characteristics. They typically consist of two plots: one for monitoring the mean
value of the process variable in question; the other for monitoring the variability, although the chart for the mean customarily receives more attention.
These charts are nothing but graphical means of carrying out the hypotheses
tests: H0 : Process Mean = Target; Process Variability = Constant, versus the
alternative: Ha : Process mean = Target; and/or Process Variability = Constant. In practice, these tests are implemented in real-time by adding each new
set of process/product data as they become available. Modern implementations involve displays on computer screens that are updated at xed intervals
of time, with alarms sounding whenever an alarm-worthy event occurs.
It is important to stress that the control limits indicated in Fig 22.3 are
not specication limits; these control limits strictly arise from the sampling
distribution of the process variable, Y , and are indicative of typical variability
intrinsic to the process. The control limits enable us determine if observed variability is in line with what is typical. In the language of the quality movement,
these control limits therefore constitute the voice of the process. Specication limits on the other hand have nothing to do with the process; they are
specied by the customer, independent of the process, and therefore constitute
what is known as the voice of the customer.

Quality Assurance and Control

947

A few of the various charts that exist for various process and product
variables and attributes are now discussed.
The Shewhart Xbar Chart
By far the oldest, most popular and most recognizable control chart is the
Shewhart chart, named for Walter A. Shewhart (18911967), the Bell Labs
physicist and engineer credited with pioneering industrial statistical quality

control. In its most basic form, it is a chart used to track the sample mean, X,
of a process or product variable: for example, the mean outer diameter of ball
bearings; the mean length of 6-inch nails; the mean liquid volume of 12-ounce
cans of soda; the mean Mooney viscosity of several samples of an elastomer,
the sample mean of the process
etc. The generic variable Y in this case is X,
measurements.
The data requirement is as follows: a random sample, X1 , X2 , . . . , Xn is
and stanobtained from the process in question, from which the average, X,
dard deviation, SX , are computed. The probability model underlying the Shewhart chart is the gaussian distribution, justied as follows. There are many
instances where the variable of interest, X, is itself approximately normally
N (0 , 2 ); but even when X is not normally
distributed, in which case X
X
2
distributed, for most random variables, N (0 , X
) is a reasonable approx
imate distribution for X, given a large enough sample (as a result of the
Central Limit Theorem).
we are able to compute the following
With this sampling distribution for X,
probability:
0 < 3X ) = 0.9974
(22.17)
P (3X < X
providing the characteristic components of the Shewhart chart: the control
limits are 3X to each side of the target value 0 on the center line; and the
condence level is (1)100% = 99.7%. The bounds are therefore commonly
known as 3-sigma limits. The -risk of false out-of-control alarms is thus
very low at 0.003. An example follows.
Example 22.2: X-BAR CONTROL CHART FOR 6-INCH
NAILS
Every ve minutes, a random sample of 3 six-inch nails is selected from
a manufactured lot and measured for conformance to the specication.
The data in Table 22.1 is a record of the measurements determined over
the rst hour of a shift. Obtain an X-bar chart and identify whether or
not the manufacturing process is in-control.
Solution:
The points to be plotted are the averages of the three samples corresponding to each sample time; the center line is the target specication
of 6 inches. To obtain the control limits, however, observe that we have
not been given the process standard deviation. This is obtained from
the data set itself, assuming that the process is in-control.
Computer programs such as MINITAB can be used to obtain

948

Random Phenomena
Xbar Chart of Nails
6.2
UCL=6.1706

Sample Mean

6.1

6.0

5.9
LCL=5.8294
5.8
1

6
7
Sample

FIGURE 22.4: The X-bar chart for the average length measurements for 6-inch nails
determined from samples of three measurements obtained every 5 mins.

the desired X-bar chart. Upon entering the data into a worksheet,
in MINITAB, the sequence Stat>Control Charts>Variables Charts
for Subgroups> X-bar> opens a self-explanatory dialog where the
problem characteristics are entered. The result is the chart shown in
Fig 22.4. Observe that the entire collection of data, the twelve average values, are all within the control limits, implying that the process
appears to be in control.

The S-Chart
The objective of the original Shewhart chart is to determine the status of
the mean value of the process/product variable
the process with respect to X,
of interest. But this is not the only process/product characteristic of interest.
may remain on target while the variability may have changed.
The average, X,
There are cases of practical importance where the variability is the primary
variable of interest, especially when we are concerned with detecting if the
process variability has changed signicantly.
Under these circumstances, the variable of interest, Y , is now SX , the
sample standard deviation, determined from the same random sample of size
The probability model is obtained from the fact that, for
n used to obtain X.
a sample size of n,
&
n
2
i=1 (Xi X)
(22.18)
SX =
n1

Quality Assurance and Control

949

TABLE 22.1:

Measured length of samples of 6-inch

nails in a manufacturing process
Length Sample
(in)
6.05
1
6.10
1
6.01
1
5.79
2
5.92
2
6.14
2
5.86
3
5.90
3
5.86
3
5.88
4
5.98
4
6.08
4

Length Sample
(in)
6.01
5
6.17
5
5.90
5
5.86
6
6.03
6
5.93
6
6.17
7
5.81
7
5.95
7
6.10
8
6.09
8
5.95
8

Length Sample
(in)
6.06
9
6.06
9
6.07
9
6.09
10
6.07
10
6.07
10
6.00
11
6.10
11
6.03
11
5.98
12
6.01
12
5.97
12

and, from previous discussions in Chapters 14 and 15, we know that

2
(n 1)SX
2 (n 1)
2
X

(22.19)

2
where X
is the inherent process variance. It can be shown, rst, that

E (SX ) = c(n)X
where the sample-size-dependent constant c(n) is given by
)

(n/2)
2
c(n) =
n 1 ( n1
2 )

(22.20)

(22.21)

so that SX /c(n) is unbiased for X , providing us with an expression for the

center line.
Obtaining the control limits for SX requires a measure of the variability
associated with estimating X with Sx /c(n). In this regard, it can be shown
that
2
S2 X = X
[1 c2 (n)]
(22.22)
We may now combine all these expressions to obtain that
P [3SX < SX c(n)X < 3SX ] 0.99

(22.23)

so that:
P [(c(n)X 3SX ) < SX < (c(n)X + 3SX )] 0.99

(22.24)

950

Random Phenomena

from which the control limits for the S-chart are obtained as:
'
(

U CL = X c(n) + 3 1 c2 (n)
(
'

LCL = X c(n) 3 1 c2 (n)

(22.25)
(22.26)

Thus, when the process is in-control with respect to variability, observed

values of SX will lie within the 3-sigma bounds indicated by Eqs (22.25) and
(22.26) approximately 99% of the time. Values outside of these bounds signal
a special-cause event, at this condence level.
And now, here are some practical considerations: First, X is usually not
available; it is typical to estimate it from process data. For example, from
several samples with standard deviations, S1 , S2 , . . . , Sj , the average,
j
Si

S = i=1
(22.27)
j

where n (not the same as j) is the total

is used to estimate X as S/c(n)
number of data points employed to determine SX . Under these circumstances,
Eqs (22.25) and (22.26) become:
&

1
1
(22.28)
U CL = S 1 + 3
c2 (n)
&

1

1
(22.29)
LCL = S 1 3
c2 (n)
Whenever the computed LCL is negative, it is set to zero for the obvious reason
that standard deviation is non-negative. Finally, these somewhat intimidatinglooking computations are routinely carried out by computer programs. For
example, the S-chart for the data used in Example 22.2 is shown here in
Fig 22.5. It is obtained from MINITAB using the sequence: Stat > Control
Charts > Variables Charts for Subgroups > S >; it shows that the process variability is itself reasonably steady.
It is typical to combine the X-bar and S charts to obtain the Xbar-S
chart. This composite chart allows one to conrm that the process is both
on-target (indicated by the Xbar component) and stable (indicated by the
S component). It is possible for the process to be stable and on-target (the
preferred state); stable but not on-target; not stable and not on-target; and
less likely (but not impossible), on-target but stable. The combination XbarS chart allows the determination of which of these four possible states best
describes the process.
Variations to the Xbar-S Chart: Xbar-R, and I & MR Charts
Sometimes the process data sample size is not large enough to provide
reasonable estimates of the standard deviation, S. In such cases, the sample

Quality Assurance and Control

951

S Chart of Nails
0.25
UCL=0.2242

Sample StDev

0.20

0.15

_
S=0.0873

0.10

0.05

LCL=0

0.00
1

6
7
Sample

FIGURE 22.5: The S-chart for the 6-inch nails process data of Example 22.2.

range, R (the dierence between the lowest and the highest ranked observations in the sample) is used as a measure of process variability. This gives rise
to the R-chart by itself, or the Xbar-R chart when combined with the Xbar
chart. The same principles discussed previously apply: the chart is based on
a probability model for the random variable, R; its expected value and its
theoretical variance are used to obtain the control limits.
In fact, because the data for the process in Example 22.2 involves samples
of size n = 3, it is questionable whether this sample size is sucient for
obtaining reliable estimates of sample standard deviation, . When n < 8, it
is usually recommended to use the R chart instead.
The combination Xbar-R chart for the Nails data of Example 22.2 is shown here in Fig 22.6. It is obtained from MINITAB
using the sequence: Stat > Control Charts > Variables Charts for
Subgroups > Xbar-R >. The most important points to note are: (i) the chart
still indicates that the process variability is reasonably steady; however, (ii)
the nominal value for R is almost twice that for S (this is expected, given
the denition of the range R and its relation to the standard deviation, S);
by the same token, the control limits are also wider (approximately double);
nevertheless, (iii) the general characteristics of the R component of the chart
is not much dierent from that of the S-chart obtained earlier and shown in
Fig 22.5. Thus, in this case, the S and R variables show virtually the same
characteristics.
In many cases, especially common in chemical processes, only individual
measurements are available at each sampling time. Under these circumstances,
with sample size n = 1, one can denitely plot the individual measurements
against the control limits, so that this time, the variable Y is now the actual

952

Random Phenomena
Xbar-R Chart of Nails

Sample M ean

6.2

U C L=6.1706

6.1
_
_
X=6

6.0
5.9

LC L=5.8294

5.8
1

6
7
Sample

U C L=0.4293

Sample Range

0.4
0.3

_
R=0.1667

0.2
0.1

LC L=0

0.0
1

6
7
Sample

FIGURE 22.6: The combination Xbar-R chart for the 6-inch nails process data of
Example 22.2.

process measurement X, not the average; but with no other means available
for estimating intrinsic variability, it is customary to use the moving range,
dened as:
(22.30)
M Ri = |Xi Xi1 |
the dierence between consecutive observations, as a measure of the variability. This combination gives rise to the I and MR chart (Individual and
Moving Range). The components of this chart are also determined using the
same principles as before: the individual samples are assumed to come from a
gaussian distribution, providing the probability model for the I chart, from
which the control limits are obtained, given an estimate of process variability,
. Upon assuming that individual observations are mutually independent, the
expected value and theoretical variance of the moving range are used to obtain
the control limits for the MR chart. We shall return shortly to the issue of
the independence assumption. For now we note once more that the required
computations are easily carried out with computer programs. The following
example illustrates the I and MR chart for a polymer process.
Example 22.3: CONTROL CHART FOR ELASTOMER PROCESS
Ogunnaike and Ray, (1994)1 presented in Chapter 28, hourly lab measurements of Mooney viscosity obtained for a commercial elastomer
manufactured in a continuous process. The data set is reproduced here
in Table 22.2. If the desired target Mooney viscosity value for this product is 50.0, determine whether or not the process is stable and on target.
1 B.A. Ogunnaike, and W.H. Ray, (1994). Process Dynamics, Modeling and Control,
Oxford, NY.

Quality Assurance and Control

TABLE 22.2:

Hourly
Mooney viscosity data
Time Sequence Mooney
(in hours)
Viscosity
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

49.8
50.1
51.1
49.3
49.9
51.1
49.9
49.8
49.7
50.8
50.7
50.5
50.0
50.3
49.8
50.8
48.7
50.4
50.8
49.6
49.9
49.7
49.5
50.5
50.8

953

954

Random Phenomena
I-MR Chart of Mooney

Individual V alue

U C L=51.928

51
_
X=50

50
49

LC L=48.072

48
1

13
15
O bser vation

M oving Range

2.4

U C L=2.369

1.8
1.2
__
M R=0.725

0.6

LC L=0

0.0
1

13
15
O bser vation

FIGURE 22.7: The combination I-MR chart for the Mooney viscosity data.

Solution:
Because we only have individual observations at each sampling time, this
calls for an I and MR chart. Such a chart is obtained from MINITAB
using the sequence: Stat > Control Charts > Variables Charts for
Individual > I-MR >. The result is shown in Fig 22.7, which indicates
that the process is in statistical control.

The P-Chart
When the characteristic of interest is the proportion of defective items in
a sample, the appropriate chart is known as the P -chart. If X is the random
variable representing the number of defective items in a random sample of size
n, then we know from Chapter 8 that X possesses a binomial distribution, in
this case,

n x
(22.31)
f (x) =
(1 )nx
x
where is the true but unknown population proportion of defectives. The
maximum likelihood estimator,
P =

X
n

(22.32)

is unbiased for . From the characteristics of the binomial random variable,

we know that
E(X) = n
2
X
= n(1 )

(22.33)
(22.34)

Quality Assurance and Control

955

so that
E(P ) =
P2

(1 )
n

(22.35)
(22.36)

From data consisting of k separate samples of size n each, yielding k actual

proportions of defectives, p1 , p2 , . . . , pk , is estimated as p dened as:
k
p =

i=1

with the associated standard deviation,

)
p(1 p)

p =
n

(22.37)

(22.38)

These results can be used to construct the P -chart to monitor the proportion of defectives in a manufacturing process. The center line is the traditional
long term average p, and the 3-sigma control limits are:
)
p(1 p)
U CL = p + 3
(22.39)
n
)
p(1 p)
(22.40)
LCL = p 3
n
Once again, negative values of LCL are set to zero. The following example
illustrates the P -chart.
Example 22.4: IDENTIFYING SPECIAL CAUSE IN MECHANICAL PENCIL PRODUCTION
A mechanical pencil manufacturer takes a sample of 10 every shift and
tests the lead release mechanism. The pencil is marked defective if the
lead release mechanism does not function as prescribed. Table 22.3 contains the results from 10 consecutive shifts during a certain week in the
summer; it shows the sample size, the number of defective pencils identied and the proportion defective. Obtain a control chart for the data
and assess whether or not the manufacturing process is in control.
Solution:
The P -chart obtained from these data using MINITAB (sequence: Stat
> Control Charts > Attributes Charts > P >) is shown in Fig 22.8
where one point, the entry from the 9th shift, falls outside of the UCL.
The MINITAB output is as follows:
Test Results for P Chart of Ndefective
TEST 1. One point more than 3.00 standard deviations from center

956

Random Phenomena

TABLE 22.3:

Number and
proportion of defective mechanical pencils
Shift

Sample
Size

Number
defective

Proportion
defective

1
2
3
4
5
6
7
8
9
10

10
10
10
10
10
10
10
10
10
10

0
0
0
0
2
2
1
0
4
0

0.0
0.0
0.0
0.0
0.2
0.2
0.1
0.0
0.4
0.0

P Chart of Ndefective
1

0.4

UCL=0.3615

Proportion

0.3

0.2

0.1

_
P=0.09

0.0

LCL=0
1

5
6
Sample

FIGURE 22.8: P-chart for the data on defective mechanical pencils: note the 9th
observation that is outside the UCL.

Quality Assurance and Control

957

line.
Test Failed at points: 9
Upon further review, it was discovered that during shift 9 (Friday
morning) was when a set of new high school summer interns were being
trained on how to run parts of the manufacturing process; the mistakes
made were promptly rectied and the process returned to normal by
the end of shift 10.

The C-Chart
When the process/product characteristic of interest is the number of defects per item (for example, the number of inclusions on a glass sheet of given
area, as introduced in Chapter 1 and revisited several times in ensuing chapters), the appropriate chart is the C-chart. This chart, like the others, is developed on the basis of the appropriate probability model, which in this case,
is the Poisson model. This is because X, the random variable representing the
number of defects per item is Poisson-distributed, with pdf
f (x) =

e x
x!

(22.41)

(The tradition is to represent this attribute by C for count). Observe from

what we know about the Poisson random variable that
C
2
C

=
=

(22.42)
(22.43)

Thus, once again, from a random sample, X1 , X2 , . . . Xk , which represents the

number of defects found on k separate items, one can estimate from the
sample average:
k
Xi

(22.44)
= i=1
k
from where the 3-sigma control limits are obtained as:

+3

U CL =
(22.45)

3
(22.46)
LCL =
setting LCL = 0 in place of negative values.
An example C-Chart for the inclusions data introduced in Chapter 1 (Table
1.2), is shown in Fig 22.9, obtained from MINITAB using the sequence Stat
> Control Charts > Attributes Charts > C >. The lone observation of 5
inclusions (in the 33rd sample) is agged as out of limit; otherwise, the process
seems to be operating in control, with an average number of inclusions of
approximately 1, and an upper limit of 4 (see Eq (22.45) above).
If we recall the discussion in Chapter 15, especially Example 15.15, we note

958

Random Phenomena
C Chart of Inclusions
1

UCL=4.042

Sample Count

2
_
C=1.017

LCL=0
1

31
Sample

FIGURE 22.9: C-chart for the inclusions data presented in Chapter 1, Table 1.2, and
discussed in subsequent chapters: note the 33rd observation that is outside the UCL,
otherwise, the process appears to be operating in statistical control
that this C-chart is nothing but a visual, graphical version of the hypothesis
test carried out in that example. We concluded then that the process was on
target (at that time, at the 95% condence level); we reach the same conclusion
with this chart, at the 99% condence level.

22.3.4

Enhancements

Motivation
Basic SPC charts, as originally conceived, needed enhancing for several
reasons, the three most important being:
1. Sensitivity to small shifts;
2. Serial correlation;
3. Multivariate data
It can be truly challenging for standard SPC charts to detect small changes.
This is because of the very low -risk ( = 0.003 compared to = 0.05 used for
hypothesis tests) chosen to prevent too many false out-of-control alarms. The
natural consequence is that the -risk of failing to identify an out-of-control
situation increases. To illustrate, consider the Mooney viscosity data shown
in Table 22.2; if a step increase in Mooney viscosity of 0.7 occurs after sample
15 and persists, a sequence plot of both the original data and the shifted data
is shown in Fig 22.10, where the shift is clear. However, an I-Chart for the
shifted data, shown in Fig 22.11, even after specifying the population standard

Quality Assurance and Control

959

Time Series Plot of Mooney, M_Shift

Variable
Mooney
M_Shift

51.5
51.0

Data

50.5
50.0
49.5
49.0

12 14
Index

FIGURE 22.10: Time series plot of the original Mooney viscosity data of Fig 22.7 and
Table 22.2, and of the shifted version showing a step increase of 0.7 after sample 15.
deviation as = 0.5 (less than the value of approximately 0.62 used for the
original data in Fig 22.7), is unable to detect the shift.
Techniques employed to improve the sensitivity to small changes, include
the Western Electric Rules2 , the CUSUM (Cumulative Sum) chart, and the
EWMA (Exponentially Weighted Moving Average) chart, which will all be
discussed shortly.
The issue of serial correlation is often a key characteristic of industrial
chemical processes where process dynamics are signicant. Classical SPC assumes no serial correlation in process data, and that mean shifts occur only
due to infrequent special causes. The most direct way to handle this type of
process variability is Engineering/Automatic Process Control. We will also
introduce this briey.
Finally, industrial processes are intrinsically multivariable so that the data
used to track process performance come from several process variables, sometimes numbering in the hundreds. These process measurements are such that
if, for example, y1 = {y11 , y12 , . . . , y1n } represents the sequence of observations for one variable, say reactor temperature, and there are others just like it,
yk = {yk1 , yk2 , . . . , ykn }, sequences from other variables, k = 2, 3, . . . , m, (say
reactor pressure, agitator amps, catalyst ow rate, etc.) then the sequences yj
and y with j = are often highly correlated. Besides this, it is also impossible
to visualize the entire data collection properly with the usual single, individual variable SPC charts. Special multivariate techniques for dealing with this
type of process variability will be discussed in Chapter 23.
2 Western Electric Company (1956), Statistical Quality Control Handbook. (1st Edition.),
Indianapolis, Indiana.

960

Random Phenomena
I Chart of M_Shift
51.5

UCL=51.5

Individual Value

51.0
50.5
_
X=50

50.0
49.5
49.0

LCL=48.5

48.5
1

11 13
15
Observation

FIGURE 22.11: I-chart for the shifted Mooney viscosity data. Even with = 0.5, it
is not sensitive enough to detect the step change of 0.7 introduced after sample 15.

Western Electric Rules

The earliest enhancement to the Shewhart chart came in the form of what
is known as the Western Electric Rules. With the standard control limits set
at 3 from the center line, the following is a version of these rules (with the
original Shewhart condition as the rst rule).
A special event is triggered when:
1. One point falls outside the 3 limits; or
2. Two of 3 consecutive points fall outside the 2 limits; or
3. Four of 5 consecutive points fall outside the 1 limits; or
4. Eight consecutive points fall on either side of the center line.
These additional rules derive from event probabilities for random samples
drawn from gaussian distributions and have been known to improve the standard chart sensitivity to small changes. Almost all statistical software packages
include these additional detection rules as user-selected options.
CUSUM Charts
Instead of plotting individual observations Xi , consider a strategy based
on Si , the cumulative sum of deviations from desired target, dened as:
Sn =

n

i=1

(Xi 0 )

(22.47)

Quality Assurance and Control

961

This quantity has the following distinctive characteristics: (i) random variations around the target manifest as a random walk, an accumulation of
small, zero mean, random errors; on the other hand, (ii) if there is a shift in
mean valueno matter how slightand it persists, this event will eventually
translate to a noticeable change in character, an upward trend for a positive
shift, or a downward trend for a negative shift. As a result of the persistent
accumulation, the slope of these trends will be related to the magnitude of
the change.
CUSUM charts, of which there are two types, are based on the probabilistic
characterization of the random variable, Sn . The one-sided CUSUM charts are
plotted in pairs: an upper CUSUM to detect positive shifts (an increase in the
process variable value), and the lower CUSUM to detect negative shifts. The
control limits, UCL and LCL, are determined in the usual fashion on the basis
of the appropriate sampling distribution (details not considered here). This
version is usually preferred because it is easier to construct and to interpret. It
is also possible to obtain a single two-sided CUSUM chart. Such a chart uses
a so-called V-mask instead of the typical 3-sigma control limits. While the
intended scope of this discussion does not extend beyond this brief overview,
additional details regarding the CUSUM chart are available, for example, in
Page (1961)3 , and Lucas (1976)4.
Fig 22.12 shows the two one-sided CUSUM charts corresponding directly to the I-Chart of Fig 22.11, with the standard deviation specied
as 0.5, and the target as 50. (The chart is obtained from MINITAB with
the sequence: Stat > Control Charts > Time-Weighted Charts > CUSUM
>). The upper CUSUM for detecting positive shifts is represented with dots;
the lower CUSUM with diamonds, and the non-conforming data with squares.
Note that very little activity is manifested in the lower CUSUM. This is in
contrast to the upper CUSUM where the inuence of the introduced step
change is identied after sample 18, barely three samples after its introduction. Where the I-Chart based on individual observations is insensitive to such
small changes, the amplication eect of the error accumulation implied in Eq
22.47 has made this early detection possible.
For the sake of comparison, Fig 22.13 shows the corresponding one-sided
CUSUM charts for the original Mooney viscosity data, using the same characteristics as the CUSUM charts in Fig 22.12; no point is identied as nonconforming, consistent with the earlier analysis of the original data.
EWMA Charts
Rather than plot the individual observation Xi , or the cumulative sum
shown in Eq (22.47), consider instead the following variable, Zi , dened by:
Zi = wXi + (1 w)Zi1 ; 0 w 1
3 E.S.

(22.48)

Page (1961). Cumulative Sum Charts, Technometrics, 3, 1-9.

Lucas (1976). The Design and Use of V-Mask Control Schemes, Journal of
Quality Technology, 8, 1-12.
4 J.M.

962

Random Phenomena

CUSUM Chart of M_Shift

6
5

Cumulative Sum

4
3
2

UCL=2

1
0

0
-1

LCL=-2

-2
-3
1

13
15
Sample

FIGURE 22.12: Two one-sided CUSUM charts for the shifted Mooney viscosity data.
The upper chart uses dots; the lower chart uses diamonds; the non-conforming points are
represented with the squares. With the same = 0.5, the step change of 0.7 introduced
after sample 15 is identied after sample 18. Compare with the I-Chart in Fig 22.11.

CUSUM Chart of Mooney

UCL=2

Cumulative Sum

-1

-2

LCL=-2
1

13
15
Sample

FIGURE 22.13: Two one-sided CUSUM charts for the original Mooney viscosity data
using the same characteristics as those in Fig 22.12. The upper chart uses dots; the
lower chart uses diamonds; there are no non-conforming points.

Quality Assurance and Control

963

a ltered value of Xi . By recursive substitution, we obtain Zi as

Zi = wXi +w(1w)Xi1 +w(1w)2 Xi2 + +w(1w)i1 X1 +w(1w)i X0
(22.49)
an exponentially weighted moving average of the past values of X. Therefore
Zi is simply a smoother version of the original data sequence.
Charts based on Zi are known as exponentially-weighted-moving-average
(EWMA) charts because of Eq (22.49). The premise is that by choosing the
weights w appropriately, small shifts in X can be detected fairly rapidly in the
resulting Z sequence. The performance of EWMA charts therefore depends on
the values chosen for the design parameter, w; and for certain choices, these
charts are related to Shewhart and CUSUM charts:
1. When w = 1, the EWMA chart is identical to the basic Shewhart chart;
2. For any other value, the EWMA chart provides a compromise between
the Shewhart chart with no memory of past data, and the CUSUM
chart with innite memory (the entire data history being carried along
in the cumulative sum). The EWMA employs w(1 w)ik as a forgetting factor which determines by how much Xk inuences Zi , i > k, in
such a way that data farther from current time i exert less inuence on
Zi than more current data.
3. The smaller the value of w, the greater the inuence of historical data,
and the further away from the basic Shewhart chartand the closer to
the CUSUM chartthe EWMA chart becomes.
4. It can be shown that specically for w = 0.4, the EWMA closely approximates the Shewhart chart in combination with Western Electric
rules.
As with other charts, the characteristics of the EWMA chart, especially the
control limits, are determined from the sampling distribution of the random
variable, Zi . For additional details, the reader is referred to Lucas and Saccussi,
(1990)5 .
For purposes of illustration, Fig 22.14 shows the EWMA chart for the
shifted Mooney viscosity data, corresponding directly to the I-Chart of Fig
22.11. The value chosen for the design parameter is w = 0.2; the standard
deviation is specied as 0.5, and the target as 50 (as in the I-Chart and
the CUSUM chart). The EWMA chart is obtained from MINITAB with the
sequence: Stat > Control Charts > Time-Weighted Charts > EWMA >.
Note the distinctive staircase shape of the control limits: tighter at the
beginning, becoming wider as more data become incorporated into the exponentially weighted moving average. As with the CUSUM chart, the shift is
detected after sample 18. For comparison, the corresponding EWMA chart
5 Lucas, J.M., and M.S. Saccucci (1990). Exponentially Weighted Moving Average Control Schemes: Properties and Enhancements, Technometrics, 32, 1-29.

964

Random Phenomena
EWMA Chart of M_Shift
50.75

EWMA

50.50

UCL=50.500

50.25
_
_
X=50

50.00

49.75

49.50

LCL=49.500
1

13 15
Sample

FIGURE 22.14: EWMA chart for the shifted Mooney viscosity data, with w = 0.2.
Note the staircase shape of the control limits for the earlier data points. With the same
= 0.5, the step change of 0.7 introduced after sample 15 is detected after sample 18.
The non-conforming points are represented with the squares. Compare with the I-Chart
in Fig 22.11 and the CUSUM charts in Fig 22.12.

for the original Monney viscosity data is shown in Fig 22.15; as expected, no
non-conforming points are identied.

22.4
22.4.1

Chemical Process Control

Preliminary Considerations

For chemical processes, where what is to be controlled are process variables such as temperature, pressure, ow rate, and liquid level, in addition
to product characteristics such as polymer viscosity, co-polymer composition,
mole fraction of light material in an overheard distillation product, etc., the
concept of process control takes on a somewhat dierent meaning. Let us
begin by representing the variable to be controlled as:
y(k) = (k) + e(k)

(22.50)

where (k) is the true but unknown value of process variable; e(k) is noise,
consisting of measurement error, usually random, and other unpredictable
components; k = 1, 2, 3, . . . is a sampling time index.
The objective in chemical process control is to maintain the process variable as close as possible to its desired target value yd (k), in the face of possible

Quality Assurance and Control

965

EWMA Chart of Mooney

50.50

UCL=50.500

EWMA

50.25
_
_
X=50

50.00

49.75

49.50

LCL=49.500
1

13 15
Sample

FIGURE 22.15: The EWMA chart for the original Mooney viscosity data using the
same characteristics as in Fig 22.14. There are no non-conforming points.
systematic variations in the true value (k), and inherent random variations
in the observed measurements y(k). There are two fundamentally dierent
perspectives of this problem, each leading naturally to a dierent approach
philosophy:
1. The Statistical Process Control (SPC) perspective; and
2. The Engineering (or Automatic) Process Control (APC) perspective.

22.4.2

Statistical Process Control (SPC) Perspective

The SPC techniques discussed previously take the following view of the
control problem: (k) tends to be constant, on target, with infrequent abrupt
shifts in value; the shifts are attributable to special causes to be identied
and eliminated; the noise term, e(k), tends to behave like independently and
identically distributed zero mean, gaussian random variables. Finally, taking
control action is costly and should be done only when there is sucient evidence in support of this need for action.
Taken to its logical conclusion, the implications of such a perspective is the
following approach philosophy: observe each y(k) and analyze for infrequent
shifts; take action only if a shift is detected with a pre-specied degree of
condence. It is therefore not surprising that the tools of SPC are embodied
in the charts presented above: Shewhart, CUSUM, EWMA, etc. But it must
be remembered that the original applications were in the discrete-parts manufacturing industry where the assumptions regarding the problem components
as viewed from the SPC perspective are more likely to be valid.

966

22.4.3

Random Phenomena

Engineering/Automatic Process Control (APC) Perspective

The alternative APC perspective is that, left unattended, (k) will wander
and not remain on-target because of frequent, persistent, unavoidable and unmeasured/unmeasurable external disturbances that often arise from unknown
sources; that regardless of underlying statistics, the contribution of the randomly varying noise term, e(k), to the observation, y(k), is minor compared to
the contributions due to natural process dynamics and uncontrollable external
disturbance eects. For example, the eect of outside temperature variations
on the temperature prole in a renerys distillation column that rises 130
ft into the open air will swamp any variations due to random thermocouple
measurement errors. Finally, there is little or no cost associated with taking
control action. For example, in response to an increase in the summer afternoon temperature, it is not costly to open a control valve to increase the
cooling water ow rate to a distillation columns condenser; what is costly
is not increasing the cooling and therefore allowing expensive light overhead
material to be lost in the vent stream.
The natural implication of such a perspective is that every observed deviation of each y(k) from the desired yd (k) is considered signicant; as a result of
which control action is implemented automatically at every sampling instant,
k, according to a pre-designed control equation,
u(k) = f ((k))

(22.51)

(k) = yd (k) y(k)

(22.52)

where
is the feedback error indicating the discrepancy between the observation
y(k) and its desired target value, yd (k), and f (.) is a control law that is based
on specic design principles. For example, the standard (continuous time)
Proportional-Integral-Derivative (PID) controllers operate according to

d(t)
1 t
(v)dv + D
u(t) = Kc (t) +
(22.53)
I 0
dt
where Kc , I and D are controller parameters chosen to achieve desired controller performance (see, for example, Ogunnaike and Ray, (1994) referenced
earlier).
The engineering/automatic control philosophy therefore is to transfer variability from where it will hurt to where it will not and as quickly as
possible. This is to be understood in the sense that variability transmitted to
the process variable, y(k), (for example, the distillation columns temperature
prole) if allowed to persist will adversely aect (hurt) product quality; by
adjusting the manipulated variable u(k) (for example, cooling water ow rate)
in order to restore y(k) to its desired value, the variability that would have
been observed in y(k) is thus transferred to u(k), which usually is not a problem. (Variations in cooling water ow rate is typically not much of a concern

Quality Assurance and Control

967

as for instance, variations in a distillation columns overhead product quality

resulting from variations in the column temperature prole.)
Designing automatic controllers that are able to adjust manipulated variables u(k) to achieve engineering control objectives eectively is discussed in
many textbooks on process control. On the surface, it appears as if SPC and
APC approaches are diametrically opposed. However, a discussion in Chapter
28 of Ogunnaike and Ray (1994) puts the two approaches in perspective and
shows how, given certain statistical models of chemical processes, the classical automatic controllers on the one hand, and the statistical process control
charts on the other are in fact optimal stochastic controllers. For example:
1. For a pure gain process, for which
(k) = Ku(k)

(22.54)

and with e(k) N (0, 2 ), the minimum variance control law is shown
to be equivalent to the Shewhart charting paradigm;
2. If, instead of the zero mean gaussian noise model for e(k) above, the
disturbance model is
d(k) = d(k 1) + e(k) e(k 1)

(22.55)

(known in time-series analysis as an integrated moving average,

IMA(1,1) model), then the minimum variance controller is shown to
be equivalent to using the EWMA chart to implement control action;
3. Finally, if the dynamic process model is rst order, i.e.,
(k) = a1 (k 1) + b1 u(k) + d(k)

(22.56)

or second order,
(k) = a1 (k 1) + a2 (k 2) + b1 u(k) + b2 u(k 2) + d(k) (22.57)
where the disturbance model is as in Eq (22.55) above, then the minimum variance controller is shown to be exactly the discrete time version
of the PID controller shown in Eq (22.53).
Additional details concerning these matters lie outside the intended scope
of this chapter and the interested reader may consult the indicated reference.
One aspect of that discussion that will be summarized here briey concerns
deciding when to choose one approach or the other.

22.4.4

SPC or APC

The following three basic process and control attributes play central roles
in choosing between the SPC approach or the APC approach:

968

Random Phenomena

1. Sampling Interval: This refers to how frequently the process output variable is measured; it is a process attribute best considered relative to the
natural process response time. If a process with natural response time on
the order of hours is sampled every minute, the dynamic characteristics
of the process will be evident in the measurements and such measurements will be correlated in time. On the other hand, data sampled every
hour from a process with a natural response time of minutes will most
likely not show any dynamic characteristics and the observations are
more likely to be uncorrelated.
2. Noise (or disturbance) character: This refers to the random or unexplainable part of the data. Is this due mainly to purely random variation
with an occasional special cause shift, or is it due to systematic drifts
and frequent special cause shifts?
3. Cost of implementing control action: Are control adjustments costly or
mostly cost-free? This is the dierence between shutting down a wafer
polishing machine to adjust its settings versus opening and closing a
control valve on a cold water supply line to a rectors cooling jacket.

When SPC is More Appropriate

The statistical process control approach is more appropriate when the
three process and control attributes have the following characteristics:

1. Sampling Interval: When the sampling interval is large relative to the

natural process response time, the assumption that the process mean is
essentially constant and free of dynamics is reasonable. This may even
allow the operator time to nd and eliminate the assignable causes.
2. Noise (or disturbance) character: When the process is not subject to
signicant disturbances, the observed variability will then essentially be
due to random events with infrequent special cause shifts.
3. Cost of implementing control action: When the cost of making control
adjustments is signicant, changes should be made only after there is
conclusive evidence of a real need for the adjustment.

When APC is More Appropriate

The automatic process control approach is more appropriate when the
process and control attributes are characterized as follows:

Quality Assurance and Control

969

1. Sampling Interval: When the sampling interval is small relative to the

natural process response time, the natural process dynamics will be in
evidence and the data will be serially correlated, contradicting a fundamental SPC assumption; uncorrected deviations from target will tend
to persist; and there will be too many assignable causes, many of which
cannot be corrected. (There is nothing, for example, that a renery
operator can do in the summer to eliminate the increasing outside
temperature and prevent it from aecting the temperature prole in a
distillation column that is exposed to the atmosphere.)
2. Noise (or disturbance) character: When the process is subject to persistent, signicant disturbances, the observed variability will be due more
to the eect of frequent special cause shifts than to purely random
events that require no action.
3. Cost of implementing control action: When the cost of making control
adjustments is negligible, control action should be taken according to
well-designed control laws.

22.5

Process and Parameter Design

If acceptance sampling is taking action to rectify quality issues after manufacturing is complete, and process control is taking action during manufacturing, process and parameter design is concerned with taking action preemptively before manufacturing. According to this paradigm, wherever possible
(and also economically feasible), operating parameters should be chosen to
minimize the eects of uncontrollable factors that inuence product quality
variability. (If it is avoidable and the cost is not prohibitive, why expose a
distillation column to the elements?)

22.5.1

Basic Principles

The primary ideas and concepts, due to Genichi Taguchi (born 1924),
a Japanese engineer and statistician, involve using design of experiments to
improve process operation ahead of time, by selecting those process parameters
and operating conditions that are most conducive to robust process operation.
With this paradigm, process variables are classied as follows:
1. Response variables: The variables of interest; quality indicators;

970

Random Phenomena

2. Control factors: The variables that aect the response; their levels can
be decided by the experimenter, or process operator. (The reader familiar with chemical process control will recognize these variables as the
manipulated variables);
3. Noise factors: The variables that aect the response but are not controllable. (Again, readers familiar with chemical process control will recognize these as the disturbance variables.)
Taguchi techniques are concerned with answering the question:

At what level of control factors is the process response least susceptible to the eect of noise factors?

This question is answered by conducting experiments using what are now

known as Taguchi designs. These designs are structurally similar to the fractional factorial designs discussed in Chapter 19: they are orthogonal, and are
all Resolution III designs. As in response surface designs, the primary objective is to nd optimum settings for the control factors; but the optimization
criteria are related not so much to the level of the response, but more importantly to the variability as well. The designs are therefore based on actively
co-opting noise factors and using them to determine quantitatively, the most
robust levels of control factors.
We are unable to accommodate any detailed discussions of these designs
and methods within the scope of this single section; interested readers may
consult the research monograph by Taguchi and Konishi, (1987)6 , and/or the
textbook by Ross7 . What we are able to present is a brief theoretical rationale
of the Taguchi concept of parameter design to show how it complements SPC
and APC.

22.5.2

A Theoretical Rationale

Let Y represent the response of interest (for example, a product quality

variable such as the Mooney viscosity of a commercial elastomer). The expected value and variance of this random variable are the usual Y and Y2 ,
respectively. If y is the specied desired target value for Y , dene
D = Y y

(22.58)

6 Taguchi, G. and Konishi, S., (1987), Orthogonal Arrays and Linear Graphs, Dearborn,
MI, ASI Press.
7 Ross, P.J. (1996). Taguchi Techniques for Quality Engineering, McGraw Hill, NY.

Quality Assurance and Control

971

the deviation of this random variable from the target. D is itself a random
variable with expected value, , i.e.,
E(D) =

(22.59)

= E(D) = E(Y ) y = Y y

(22.60)

Observe that by denition,

Let us now consider the following specic loss function:

L(y) = E(D2 )

(22.61)

as the loss incurred by Y deviating from the desired target (in this case, a
squared error, or quadratic, loss function). By introducing Eq (22.58) into Eq
(22.61), we obtain:
"
!
L(y) = E (Y y )2

2
= E [(Y Y ) + (Y y )]
=

Y2 + 2

(22.62)

This can be shown to represent an orthogonal decomposition of the squared

deviation of the quality variable Y , from its target, y ; i.e.,
"
!
(22.63)
L(y) = Y2 + (Y y )2
Inherent Process + Process Operating
Variability

Bias

Traditional SPC and/or engineering APC is concerned with, and can only
deal with, the second term, by attempting to drive Y to y and hence eliminate the bias. Even if this can be achieved perfectly, the rst term, Y2 , still
persists. Alternatively, this decomposition shows that the minimum achievable value for the loss function (achieved when Y = y ) is Y2 . Thus even
with a perfect control scheme, L(y) = Y2 . Taguchis methodology is aimed at
nding parameter settings to minimize this rst term, by design. When the
process is therefore operated at these optimum conditions, we can be sure that
the minimum loss achieved with eective process control is the best possible.
Without this, the manufacturer, even with the best control system, will incur
product quality losses that cannot be compensated for any other way.

22.6

Summary and Conclusions

As with the previous chapter, this chapter has also been primarily concerned with showcasing one more application of probability and statistics that

972

Random Phenomena

has evolved into a full-scale subject matter in its own right. In this particular
case, it is arguable whether the unparalleled productivity enjoyed by modern manufacturing will be possible without the tools of quality assurance and
control, making the subject matter of this chapter one of the most inuential
applications of probability and statistics of the last century.
The presentation in three distinct units was a deliberate attempt to place
in some historical perspective, the techniques that make up the core of quality
assurance and control. Yet, fortuitously (or not) this demarcation also happens
to coincide precisely with where along the manufacturing process time-line the
technique in question is applicable. Thus, what we discussed rst, acceptance
sampling, with its post-production focus and applicability, is almost entirely
a thing of the past (at least as a stand-alone quality assurance strategy). Our
subsequent discussion of process control, the during-production strategy,
covered the historical and modern incarnation of control charts, augmented
with a brief summary of the automatic control paradigmwhich included
a discussion of how the two apparently opposite philosophies are really two
perspectives of the same problem. In the nal unit, we only provided the
briefest of glances at the Taguchi techniques of parameter design, the preproduction strategy, choosing instead to emphasize the basic principles and
rationale behind the techniques. Regardless of the successes of pre-production
designs, however, process control will forever be an intrinsic part of manufacturing; the implementation ideas and techniques may advance, but the basic
concept of monitoring process performance and making real-time, in-process
adjustments to maintain control in the face of unavoidable, unpredictable, and
potentially destabilizing variability, will always be a part of modern manufacturing.
We note, in closing, that because these quality control techniques arose
from industrial needs, and were therefore developed exclusively for industrial
manufacturing processes, they are so completely enmeshed with industrial
practice that acquiring a true practical appreciation outside of the industrial
environment is virtually impossible. To approximate the industrial experience
of applying these techniques (especially to experience, rst-hand, the realtime, sequential-in-time data structure that is intrinsic to these methods)
we oer a few project assignments here in place of the usual exercises and
applications problems.

REVIEW QUESTIONS
1. To what objective is the subject matter of quality assurance and control devoted?
2. What characteristic of manufacturing processes makes quality assurance and control mandatory?
3. What are the three problems associated with assuring the quality of mass produced products? Which one did traditional quality assurance focus on?

Quality Assurance and Control

973

4. What is the prevalent total quality management view of quality assurance?

5. What is acceptance sampling, and what is its dening characteristic?
6. What does acceptance sampling involve?
7. What is an acceptance sampling by attribute problem as opposed to an acceptance sampling by variable problem?
8. What is the Acceptable Quality Level (AQL)?
9. What is the Rejectable Quality Level (RQL), and by what alternative term is it
also known?
10. What does it mean that a lot is of indierent quality?
11. What is a single sampling plan as opposed to a double or multiple sampling
plan?
12. What is the ideal sampling plan from a consumers perspective as opposed to
what is ideal from a producers perspective?
13. Why is , the risk of committing a Type I error in hypothesis testing, also known
as a producers risk?
14. Why is , the risk of committing a Type II error in hypothesis testing, also
known as a consumers risk?
15. What are sampling plans designed to do about the and risks?
16. What is the operating characteristic (OC) curve and what is it mostly used for?
17. To generate a sampling plan, what four parameters must be specied?
18. What is the Binomial approximation OC curve and how is it dierent from the
Poisson OC curve?
19. What is the shape of the ideal OC curve?
20. Why is acceptance sampling not considered a const-eective quality assurance
strategy?
21. What is the underlying philosophy behind process control as a quality assurance
strategy? What primary issues are to be addresses in implementing such a strategy?
22. What is statistical process control (SPC)?
23. When is a process (or process variable) said to be in statistical control?

974

Random Phenomena

24. What is special cause variability as opposed to common cause variability?

25. In what way is SPC like hypothesis testing?
26. What are control charts, and what three lines are found on all SPC charts?
27. In traditional SPC what is the recommended corrective action for an out-ofcontrol situation?
28. What is the dierence between control limits and specication limits?
29. What is the Shewhart Xbar chart used for?
30. What is the sampling distribution underlying the Shewhart chart?
31. What are the characteristic components of the Shewhart chart?
32. What is the S-chart used for and what are its main characteristics?
33. What is the Xbar-R chart and how is it dierent from the I & MR chart?
34. What is the P-chart used for and what are its main characteristics?
35. What is the C-chart used for and what are its main characteristics?
36. Why do basic SPC charts need enhancing?
37. What are the Western Electric rules?
38. What is the CUSUM chart and what are the principles behind it?
39. What is the EWMA chart?
40. What is the objective in chemical process control, and what are the two fundamentally dierent perspectives of the problem?
41. What is the engineering/automatic process control perspective of the control
problem?
42. What is the engineering/automatic process control philosophy about transferring variability?
43. What are the three basic process and control attributes to be used in choosing
between statistical process control (SPC) and automatic process control (APC)?
44. When is SPC the more appropriate approach to chemical process control?
45. When is APC the more appropriate approach to chemical process control?

Quality Assurance and Control

975

46. If acceptance sampling is an after-production strategy, and process control is

a during-production strategy, what sort of strategy is parameter design?
47. In process and parameter design, what distinguishes control factors from
noise factors?
48. Taguchi techniques are concerned with answering what question?
49. What are Taguchi designs?
50. What is the squared error loss function?
51. What are the two components of the squared error loss function and how are
they connected to process control on one hand and parameter design on the other?

PROJECT ASSIGNMENTS
1. Tracking the Dow
Even though many factorssome controllable, some not; some known, some
unknowncontribute to the daily closing value of the Dow Jones Industrial Average
(DJIA) index, it has been suggested that at a very basic level, the change in closing
value from day to day in this index is distributed approximately as a zero mean
random variable. Precisely what the distribution ought to be remains a matter of
some debate.
Develop an SPC chart to track (k) = )k) (k 1), where (k) is the closing
value of the Dow average on day k, with (k 1) as the previous days value. From
historical data during a typical period when the markets could be considered stable, determine base values for and , or else assume that = 0 theoretically so
that the chart will be used to identify any systemic departure from this postulated
central value. Use the value estimated for (and a postulated probability model)
to set the control limits objectively; track (k) for 2 months with this chart. Should
any point fall out of limits during this period, determine assignable causes where
possible or postulate some. Present your results in a report.
Here are some points to consider about this project:
1. The I & MR and other similar charts are based on an implicit Gaussian
distribution assumption. There have been arguments that nancial variables
such as (k) are better modeled by the heavier-tailed Cauchy distribution (see
Tanaka-Yamawaki (2003)8 and Problem 18.16 in Chapter 18). Consider this
in setting control limits and in using the limits to decide which deviations
are to be considered as indicative of out-of-control uctuation in the Dow
average.
2. A possible alternative to consider is the EWMA chart which, as a moving
average, is more likely to dampen excessive, but still typical uctuations.
3. The DJIA is not the only index of the nancial markets; in fact, many analysis
8 Mieko Tanaka-Yamawaki, (2003). Two-phase oscillatory patterns in a positive feedback
agent model Physica A 324, 380387

976

Random Phenomena
argue that it is too narrow; that the Standard and Poors (S & P) 500 index
provides a better gauge on the market. If time permits, consider a second
chart simultaneously for the S & P 500 and assess whether or not the two
indexes exhibit similar characteristics.

2. Diabetes and Process Control

If you or a family member or a friend has diabetes, and the treatment procedure
requires determining blood glucose level periodically, followed by self-administration
of insulin as needed, consider a systematic process control approach to augment what
is currently done.
Conduct a literature search on engineering control approaches for manual administration of insulin (see, e.g., Bequette and Desemone, (2004)9 , and Zisser, et al.,
(2005)10 ). Seek the assistance of a process control expert, if necessary. Determine a
measurement protocol, appropriate control limits for the glucose level, and an appropriate control strategy. Implement this strategy over a period of 1 month. At the
conclusion of the project, write a report on the planning, execution and the results.
If possible, compare the performance of the systematic process control strategy with
the strategy employed previously.

3. C-Chart for Sports Team

The number of points or goals scored by a sports team in any particular game is a
randomly varying quantity that depends on many largely uncontrollable factors. Use
a C-chart to track the performance of your favorite team over a season. Determine
base performance levels from historical data and use these to set the control limits.
At the end of the season assess the teams performance from the perspective of the
chart and write a brief report.

9 Bequette, B. W. and J. Desemone, (2004). Intelligent Dosing System: Need for Design
and Analysis Based on Control Theory, Diabetes Technology & Therapeutics, 6(6): 868-873
10 Zisser, H., Jovanovic L., Doyle, III F. J., Ospina P., Owens C. (2005), Run-to-run
control of mealrelated insulin dosing, Diabetes Technol Ther ; 7(1):48-57

Chapter 23
Introduction to Multivariate Analysis

23.1 Multivariate Probability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.1.2 The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.1.3 The Wishart Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.1.4 Hotellings T -Squared Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.1.5 The Wilks Lambda Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.1.6 The Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.2 Multivariate Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.3 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.3.1 Basic Principles of PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Determining the Principal Components and Scores . . . . . . . . . . . .
23.3.2 Main Characteristics of PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Some important results and implications . . . . . . . . . . . . . . . . . . . . . . .
Properties of PCA Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.3.3 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problem Statement and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PCA and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.3.4 Other Applications of PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Multivariate Process Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Model Building in Systems Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.4 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PROJECT ASSIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Principal Components Analysis of a Gene Expression Data Set

978
978
979
980
982
982
983
984
985
985
986
986
987
989
990
990
991
991
991
999
999
1000
1002
1003
1004
1004

The people are a many-headed beast

Horace (658 BC)

To be sure, many practical problems involving randomly varying phenomena

often manifest in the form of the single, isolated random variable, X, we have
been studying thus far. Such single random variablesbe they continuous
like the yield obtainable from a chemical process, and the reliability of a microwave oven, or discrete like the number of aws found on a manufactured
sheet of glassare mathematically characterized in terms of the probability
models developed phenomenologically in Chapters 8 and 9, and via optimization in Chapter 10. After nalizing the discussion on probability models in
those chapters, we have since focussed on how the models are used to solve
problems of interest. However, it is also true that in a good number of practical problems, the phenomenon of interest involves multiple jointly distributed
random variables, presenting a new set of challenges.
977

978

Random Phenomena

Conceptually, the principles that served us well with univariate problems

remain unchanged: obtain a mathematical characterization (in the form of
an appropriate probability model) and use it to solve the problem at hand.
However, in the multivariate case, things are a bit more complicated. While
the topic of multidimensional random variable characterization was broached
in Chapter 5, beyond the quiet, unannounced appearance of the multinomial
distribution in Chapter 8, not much has been said thus far about multivariate probability models and about intrinsically multivariate problems. That is
about to change in this chapter. But multivariate analysis is a very rich and
very broad topic; no single chapter can do it adequate justice. Therefore, the
objective in this chapter is simply to alert the reader to the existence of these
methods, and to demonstrate in a brief overview, how to generalize what has
been presented in earlier chapters about probability models and statistical
analysis to the multivariate case. We intend to present no more than a few
important, hand-picked multivariate probability models (pdfs of vector-valued
random variables), indicating their applications primarily by analogy to their
univariate counterparts. (The reader will not be surprised to nd that the role
of vectors and matrices in facilitating the scaling up of scalar algebra to multidimensional linear algebra is reprised here, too.) We also introduce principal
components analysis (PCA) as a representative methodology for multivariate exploratory data analysis, illustrating the technique and presenting some
applications.

23.1
23.1.1

Multivariate Probability Models

Introduction

A multivariate probability model is the joint pdf of the multivariate random variable (or random vector) X = (X1 , X2 , . . . , Xn ). Conceptually, it is
a direct extension of the single variable pdf, f (x). As dened in Chapter 5,
each component Xi of the random vector is itself a random variable with its
own marginal pdf, fi (xi ); and in the special case when these component random variables are independent, the joint pdf is obtained as a product of these
individual marginal pdfs, i.e.,
f (x) =

fi (xi )

(23.1)

i=1

as we saw in Chapters 13 and 14 while discussing sampling and estimation theory. In general, however, these constituent elements Xi are not independent,
and the probability model will be more complex.
Here are some practical examples of multivariate random variables:

Introduction to Multivariate Analysis

979

1. Student SAT scores: The SAT score for each student is a triplet of numbers: V , the score on the verbal portion, Q, the score on the quantitative
portion, and W , the score on the writing portion. For a population of
students, the score obtained by any particular student is therefore a
three-dimensional random variable with components X1 , X2 and X3
representing the individual scores on the verbal, quantitative and writing portions of the test respectively. Because students who do well in
the verbal portion also tend to do just as well in the writing portion, X1
will be correlated with X3 . The total score, T = X1 + X2 + X3 is itself
also a random variable.
2. Product quality characterization of glass sheets: Consider the case where
the quality of manufactured glass sheets sold specically into certain
markets is characterized in terms of two quantities: X1 , representing the
number of inclusions found on the sheet (see Chapter 1); X2 , representing warp, the extent to which the glass sheet is not at (measured as
an average curvature angle). This product quality variable is therefore a
two-dimensional random variable. Note that one of the two components
is discrete while the other is continuous.
3. Market survey: Consider a market evaluation of a several new products
against their respective primary incumbent competitors: each subject
participating in the market survey compares two corresponding products
and gives a rating of 1 to indicate a preference for the new challenger, 0 if
indierent, and 1 if the incumbent is preferred. The result of the market
survey for each new product is the three-dimensional random variable
with components X1 , the number of preferences for the new product,
X2 , the number of indierents, and X3 , the number of preferences for
the incumbent.
The multinomial model is an example of a multivariate probability model; it
was presented in Chapter 8, as a direct extension of the binomial probability
model. We now present some other important multivariate probability models.
A more complete catalog is available in Kotz et al. (2000)1

23.1.2

The Multivariate Normal Distribution

The joint pdf of the p-dimensional random variable, X = (X1 , X2 , . . . , Xp )T

for which each component random variable, Xi , is normally distributed, i.e.,
Xi N (i , i2 ), is given by:
f (x) =

1
(2)p/2 ||1/2

1
exp (x )T 1 (x )
2

(23.2)

1 S. Kotz, N. Balakrishnan, and N. L. Johnson, (2000). Continuous Multivariate Distributions, Wiley, New York.

980

Random Phenomena

where and are, respectively, the mean vector and the covariance matrix
of the random vector, X, dened by:
= E(X)

(23.3)

!
"
= E (X )(X )T

(23.4)

This is a direct extension of the Gaussian pdf given in Eq (9.125), with

the vector taking the place of the single population mean, , and taking
the place of the population variance, 2 . Thus, the important role of the
Gaussian distribution in univariate analysis is played in multivariate analysis
by the multivariate normal distribution. This particular k-dimensional vector
of random variables is then said to possess a multivariate normal distribution,
which we will represent as X Np (, ).
In the special case where p = 2, so that X = (X1 , X2 )T with X1
N (1 , 12 ) and X2 N (2 , 22 ), the result is the important bivariate normal
distribution, N2 (, ), for which

=
and

1
2

12
21

(23.5)

12
22

(23.6)

Because of symmetry, 12 = 21 , and

12 = 1 2

(23.7)

with as the correlation coecient. Eq (23.2) can be written out explicitly in

this case to give:
f (x1 , x2 ) =

1

exp{U }
21 2 1 2

(23.8)

with

(x1 1 )2
1
(x2 2 )2
(x1 1 )(x2 2 )
U=
+
2
2(1 )2
12
22
1 2

(23.9)

Fig 23.1 shows plots of the bivariate Gaussian distribution for = 0 (top) and
= 0.9 (bottom), respectively. When = 0, the two random variables are
uncorrelated and the distribution is symmetric in all directions; when = 0.9,
the random variables are strongly positively correlated and the distribution is
narrow and elongated along the diagonal.

Introduction to Multivariate Analysis

981

U 0
X2

U 0.9
X2

FIGURE 23.1: Examples of the bivariate Gaussian distribution where the two random
variables are uncorrelated ( = 0) and strongly positively correlated ( = 0.9).

23.1.3

The Wishart Distribution

Let X be the n p matrix consisting of a random sample of size n drawn

from a p-dimensional (i.e., p-variate) normal distribution with zero mean and
covariance matrix , i.e., XTi Np (0, ); i = 1, 2, . . . , n.
The elements of the p p dispersion matrix, V, dened by:
V = XT X =

Xi XTi

(23.10)

i=1

follow the Wishart distribution, Wp (, n), with n degrees of freedom, named

for the Scottish statistician, John Wishart (1898-1956), who rst developed
the probability model. The pdf is:
f (V) =

|V|(np1)/2 exp{T r(V1 )}

*
2np/2 p(p1)/4 ||n/2 pi=1 ni
2

(23.11)

where (.) is the Gamma function, || indicates the matrix determinant, and
T r() the trace of the matrix.
The Wishart distribution is a multivariate generalization of the 2 distribution. For the single variable case, where p = 1 and = 1, the Wishart
distribution reduces to the 2 (n) distribution. The expected value of V is:
E(V) = n

(23.12)

so that if S is the sample covariance matrix for each of the p-dimensional

normal random variables Xi , then the distribution of S is n1 Wp (, n 1).
Therefore, the role played by the general 2 (r) distribution in univariate

982

Random Phenomena

statistical analysis is played by the Wishart distribution in multivariate analysis. Additional information about this distribution is available in Mardia, et
al. (1979)2 .

23.1.4

Hotellings T -Squared Distribution

Let the vector x be a realization of a p-variate Gaussian random variable, M Np (, ), and let S be the sample covariance matrix obtained from
n samples of the p elements of this vector; i.e.,
1
)(xi x
)T
(xi x
n 1 i=1
n

(23.13)

where the vector average is dened as

1
xi
n i=1
n

=
x

Then from the previous section, we know that S

(23.14)
1
n Wp (, n1). The

T 2 = n(
x )T S1 (
x )

statistic
(23.15)

has the Hotellings T -squared distribution T 2 (p, n 1). This distribution is

a direct extension of the Students t-distribution to the multivariate case;
it is named for Harold Hotelling (18951973), the American mathematical
statistician and econometrician who rst developed the mathematical model3 .
Furthermore, it can be shown that:
np 2
T (p, n 1) F (p, n p)
p(n 1)

(23.16)

where F (p, n p) is the F -distribution with degrees of freedom p and n

p. Thus, the roles played by the Students t-distribution and T -statistic in
univariate analysis are played in multivariate analysis by the Hotellings T 2
distribution and statistic.

23.1.5

The Wilks Lambda Distribution

The multivariate generalization of the F -distribution is the Wilks Lambda

distribution, named for the Princeton University mathematical statistician,
Samuel S. Wilks (19061964). By direct analogy with its univariate counterpart, let the dispersion matrices U and V be independent and have the
2 K. V. Mardia, Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis, Academic
Press, Duluth, London.
3 H. Hotelling (1931). The generalization of Students ratio, Ann. Math. Statist. 2,
360378.

Introduction to Multivariate Analysis

983

respective Wishart distributions, Wp (I, n) and Wn (I, m); i.e., each associated
covariance matrix is the identity matrix, I. Then the ratio , dened as:
=

|U|
|U| + |V|

(23.17)

has the Wilks (p, n, m) distribution. It can be shown (see the Mardia et
al., reference (Reference 2)) that this distribution can be obtained as the
distribution of a product of independent beta B(, ) random variables i
where
p
m+ip
;=
(23.18)
=
2
2
for m p, i.e., if i B(, ), then
m

i (p, n, m)

(23.19)

i=1

What the F -distribution is to the Students t-distribution in univariate analysis, the Wilks distribution is to Hotellings T 2 distribution.

23.1.6

The Dirichlet Distribution

The Dirichlet distribution, named for the German mathematician Johann

Peter Gustav Lejeune Dirichlet (18051859), is the multivariate extension of
the univariate Beta distribution. It arises as follows: Let Y1 , Y2 , . . . , Yk+1 be
mutually independent random variables each with marginal Gamma(1 , 1)
pdfs, i.e.,
1
y i 1 eyi
(23.20)
fi (yi ) =
(i ) i
Dene k ratios,
Yi
(23.21)
Xi = k+1 ; i = 1, 2, . . . , k
j=1 Yj
It can be shown that the joint pdf for the k-dimensional random variable
X = (X1 , X2 , . . . , Xk ) is given by:
(1 + 2 + + k ) 1 1
k1 1
x
xk1
(1 x2 x2 xk1 )k 1
(1 )(2 ) (k ) 1
(23.22)
where, by denition, 0 < xi < 1 and

f (x) =

x1 + x2 + + xk < 1

(23.23)

Clearly, for k = 1, the pdf reduces to the beta distribution.

The most important characteristics of the distribution are as follows: The
elements i of the mean vector, , are given by:
i
i
(23.24)
=
i = k

i=1 i

984

Random Phenomena

The diagonal elements, i2 , of the covariance matrix, , are given by:

i2 =

i ( i )
2 ( + 1)

(23.25)

while the symmetric, o-diagonal elements are given by:

2
2
ij
= ji
=

i j
2 ( + 1)

(23.26)

This distribution has found application, for example, in wildlife population

studies in which n species (animals, plants, etc) reside in a given geographical
area, the proportion of the area occupied by each species, 1 , 2 , . . . , n tend
to follow a symmetric Dirichlet distribution. In engineering, the distribution
has been used to model the activity times in a PERT (Program Evaluation
and Review Technique) network. In particular, a Dirichlet distribution for the
entire network can be used to derive an upper bound for a projects completion time4 . It is also used as the conjugate prior distribution for Bayesian
estimation of multinomial distribution parameters. The Kotz et al., reference
(Reference 1) contains additional details about applications of the Dirichlet
distribution.

23.2

Multivariate Data Analysis

In what follows, unless stated otherwise, we will consider data consisting

of n variables, (j = 1, 2, . . . , n) and m samples of each variable, to give a data
matrix X that is of dimension m n, i.e.,

X1
x11 x12 . . . x1n
X2 x21 x22 . . . x2n

(23.27)
.. = ..
..
..
..
. .
.
.
.
Xm

xm1

xm2

. . . xmn

The variables can be correlated; furthermore, we impose no distributional

assumptions on the random variables (at least for now).
In general, it is also possible to have another data block, Y, (responses),
consisting of ny variables and m samples, where Y, is related to X (predictors).
Such data arise in various applications, from astrophysics to computer
network trac to chemical processes and molecular biology. The typical objectives of the analysis of such data include, but are not limited to:
4 Monhor, D.,(1987). An approach to PERT: Application of Dirichlet distribution, Optimization, 18 113118.)

Introduction to Multivariate Analysis

985

1. Identication, extraction and quantication of essential features in the

data blocks;
2. Data Compression: Reducing data dimensionality optimally to a
smaller number of essential (independent) components, and subsequent
analysis from the perspective of the reduced data space;
3. Modeling: obtaining the best possible linear relationship between Y
(responses) and X (predictors).
The typical applications in the chemical process industry include multivariate
calibration and classication in analytical chemistry, process monitoring, data
visualization and fault detection in manufacturing processes. Some applications in molecular biology include data-driven modeling and analysis of signal
transduction networks; and characterization of the conformational space of
proteins.
Of the various techniques of multivariate data analysis, including Multiple
Linear Regression (MLR), Principal Component Regression (PCR), Partial
Least Squares Regression (PLSR), we shall focus attention in the rest of the
chapter only on Principal Components Analysis (PCA), since it provides most
of the foundational elements of multivariate data analysis, from which the
interested reader can then launch forays into the other aspects.

23.3

Principal Components Analysis

Principal components analysis (PCA), rst proposed in 1901 by the British

mathematical statistician, Karl Pearson (18571936)5, is a technique for transforming a data set consisting of a (large) number of possibly correlated variables from the original coordinate system into a new and usually more informative one. This new coordinate system consists of a smaller number of
uncorrelated variables called principal components. The new set of variables
forming the new coordinate system are not only mutually uncorrelated, they
also represent a useful redistribution of the information in the data. This is
because the new set of variables are ordered such that the greatest amount
of variability is now captured along the rst axis, the next greatest by the
second, and on down to the last few axes with little or no information left to
capture. PCA is therefore useful for data simplication (dimensionality reduction) because it allows the data to be described with only the rst few truly
informative component axes; it is also therefore useful for exposing hidden
(latent) structures contained in the original data set.
5 K. Pearson, (1901) On Lines and Planes of Closest Fit to Systems of Points in Space.
Phil. Magazine 2 (6) 559572. http://stat.smmu.edu.cn/history/pearson1901.pdf.

986

23.3.1

Random Phenomena

Basic Principles of PCA

Data Preconditioning
PCA is scale-dependent, with larger numerical values naturally accorded
more importance, whether deserved or not. To eliminate any such undue inuence (especially those arising from dierences in the units in which dierent
variables are measured), each data record can mean-centered and scaled
(i.e., normalized) prior to carrying out PCA. But this is problem dependent.
Let the original data set consist of n-column vectors x1 , x2 , . . . xn , each
containing m samples, giving rise to the raw data matrix X . If the mean and
standard deviation for each column are x
i , si , then each variable i = 1, 2, . . . , n,
is normalized as follows:
x x
i
xi = i
(23.28)
si
The resulting m n pre-treated data matrix, X, consisting of columns of data
each with zero mean and unit variance.
Problem Statement
We begin by contemplating the possibility of an orthogonal decomposition
of X by expanding in a set of n-dimensional orthonormal basis vectors, p1 ,
p2 , p3 , . . ., pn , according to:
X = t1 pT1 + t2 pT2 + . . . + tn pTn
in such a way that

pTi pj =

along with

tTi tj =

1 i=j
0 i=
j
i
0

i=j
i = j

(23.29)

(23.30)

(23.31)

where ti (i = 1, 2, . . . , n), like xi , are m-dimensional vectors.

In particular, if the k-term truncation of the complete n-term expansion,
X = t1 pT1 + t2 pT2 + . . . + tk pTk + E

(23.32)

is such that the resulting residual matrix E,

ti pTi

(23.33)

i=k+1

contains only random noise, such an expansion would have then provided a kdimensional reduction of the data. It implies that k components are sucient
to capture all the useful variation contained in the data matrix.
The basis vectors, p1 , p2 , p3 , . . ., pk , are commonly referred to as loading
vectors; they provide an alternative coordinate system for viewing the data.

Introduction to Multivariate Analysis

987

The associated m-dimensional weighting vectors, t1 , t2 , t3 , . . ., tk , are the

corresponding score vectors of the data matrix; they represent how the data
will now appear in the principal component loading space, viewed from the
perspective of the induced new coordinate system.
Together, ti , pi , and k tell us much about the information contained in the
data matrix X. Principal components analysis of the data matrix, X, therefore
involves the determination of ti , pi , and k.
Determining the Principal Components and Scores
Unlike other orthogonal transforms (e.g., Fourier transforms) where the
basis functions are known `
a-priori (sines and cosines for Fourier transforms),
the basis vectors for PCA are not given ahead of time; they must be determined from the data itself. Thus, with PCA, both the basis function set as well
as the coecients of the transform are to be computed simultaneously. The
following is one of several approaches for handling this problem; it is based
on vector optimization.
We begin from
X = t1 pT1 + t2 pT2 + . . . + tk pTk + E

(23.34)

and before engaging in any optimization, we dene the nn symmetric matrix

n = E T E

(23.35)

and also the m m symmetric matrix

m = EET

(23.36)

We now seek the ti and pi vectors that minimize the squared norm of the
appropriate matrix n or m . Which matrix is appropriate depends on the
dimensionality of the vector over which we are carrying out the minimization.
First, to determine each n-dimensional vector pi , since
n = (Xt1 pT1 t2 pT2 . . .tk pTk )T (Xt1 pT1 t2 pT2 . . .tk pTk ) (23.37)
we may dierentiate with respect to the n-dimensional vector pTi obtaining:
n
= 2tTi (X t1 pT1 t2 pT2 . . . tk pTk )
pTi

(23.38)

which, upon setting to the n-dimensional row vector of zeros, 0T , and simplifying, yields
n
= tTi X i pTi = 0T
(23.39)
pTi
where the simplication arises from the orthogonality requirements on ti (see
Eq (23.31)). The solution is:
tT X
(23.40)
pTi = i
i

988

Random Phenomena

Note that this result is true for all values of i, and is independent of k; in
other words, we would obtain precisely the same result regardless of the chosen
truncation. This property is intrinsic to PCA and common with orthogonal
decompositions; it is exploited in various numerical PCA algorithms. The real
challenge at the moment is that Eq (23.40) requires knowledge of ti , which
we currently do not have.
Next, to determine the m-dimensional vectors ti , we work with m ,
m = (Xt1 pT1 t2 pT2 . . .tk pTk )(Xt1 pT1 t2 pT2 . . .tk pTk )T (23.41)
and dierentiate with respect to the m-dimensional vector ti to obtain:
m
= 2(X t1 pT1 t2 pT2 . . . tk pTk )pi
ti

(23.42)

Equating to the m-dimensional column vector of zeros, 0, and simplifying, the

result is:
m
= Xpi ti pTi pi = 0
(23.43)
ti
or,
Xpi
ti = T
= Xpi
(23.44)
pi pi
where the simplication arises again from the orthogonality requirements on
pi (see Eq (23.30)).
We now have two self-referential expressions: one for determining pi if ti
is known, Eq (23.40); the other for determining ti if pi is known, Eq (23.44).
But neither is currently known. To resolve this conundrum, we start from Eq
(23.40) and substitute Eq (23.44) for ti to eliminate it and obtain:
pTi =

pTi XT X
i

(23.45)

and if we let R= XT X, then Eq (23.45) simplies to

pTi R
or pTi (R i I)

= i pTi
= 0

(23.46)

and nally, because both R and I are symmetric, we have

(R i I)pi = 0

(23.47)

This equation is immediately recognized as the eigenvalue-eigenvector equation, with the following implications:

The loading vectors pi are the eigenvectors of the matrix R= XT X,

with the eigenvalues given by i = tTi ti (see eqn (23.31)).

Introduction to Multivariate Analysis

989

Thus, to carry out PCA:

1. Optionally mean-center and normalize the original data matrix X to
obtain the data matrix X;
2. Obtain the matrix R= XT X; (if the data is mean-centered and normalized, this matrix is related to the correlation matrix; if the data is
mean-centered only, then R is related to the covariance matrix , which
1
R);
is m1
3. Obtain the eigenvalues and corresponding eigenvectors of R; arrange
the eigenvalues in descending order such that 1 > 2 > . . . > n ; the
corresponding eigenvectors are the loading vectors pi ; i = 1, 2, . . . , n;
4. Obtain the corresponding scores from Eq (23.44) by projecting the data
matrix onto the loading vector pi .
Even though determining the truncation k is somewhat subjective, some
methods for choosing this parameter are available, for example, in Malinowski
(191)6 and Kritchman and Nadler (2008)7 . For a chosen truncation, k < n
from Eq (23.44), obtain:
[t1 t2 . . . tk ] = X[p1 p2 . . . pk ]

(23.48)

T = XP

(23.49)

or, simply

as the principal component transform of the data matrix X. The k transformed variables T are called the principal components scores.
The corresponding inverse transform is obtained from (23.29) as:
= TPT
X

(23.50)

= X only if k = n; otherwise X
is a cleaned up, lower-rank version
with X
of X. The dierence,

E = XX
(23.51)
is the residual matrix; it represents the residual information contained in the
portion of the original data associated with the (n k) loading vectors that
were excluded from the transformation.
6 Malinowski,

E. R. (1991). Factor Analysis in Chemistry, John Wiley & Sons

S. and B. Nadler (2008). Determining the number of components in a
factor model from limited noisy data. Chemometrics and Intelligent Laboratory Systems
94(1): 19-32.
7 Kritchman,

990

Random Phenomena

23.3.2

Main Characteristics of PCA

Some important results and implications

The determination of P is an eigenvalue-eigenvector problem; the numerical computation is therefore typically carried out via singular value decomposition. The following expressions hold for P:
PT RP =
or R = PPT

(23.52)
(23.53)

with the following implications:

1. Tr(R), the trace of the matrix R, the sum of the diagonal elements of
the matrix R), is equal to Tr(), the sum of the eigenvalues; i.e.,
T r(R) = T r() =

(23.54)

i=1

2. Tr(R) is a measure of the total variance in the data block X. From Eq

(23.54), this implies that the sum of the eigenvalues is also a measure of
the total variance in the data block. The fractional contribution of the
j th principal component
to the overall variance in the data is therefore
n
given by j /( i=1 i ).
3. Similarly,
|R| =

(23.55)

i=1

If, therefore, the matrix XT X = R is of rank r < n, then only r eigenvalues are non-zero; the rest are zero, and the determinant will be zero.
The matrix will therefore be singular (and hence non-invertible).
4. When an eigenvalue is not precisely zero, just small, the cumulative contribution of its corresponding principal component to the overall variation in the data will likewise be small. Such component may then be
ignored. Thus, by dening the cumulative contribution of the j th principal component as:
j
i
j = i=1
(23.56)
n
i=1 i
one may choose k such that k+1 does not add much beyond k .
Typically a plot of j versus j, known as a Scree plot, shows a knee
at or after the point j = k (see Fig 23.3).

Introduction to Multivariate Analysis

991

Properties of PCA Transformation

As a technique for transforming multivariate data from one set of coordinates into a new, more informative set, here are some of the key properties of
PCA:
1. Each loading vector pTi is seen from Eq (23.40) to be a linear combination of the columns of the data matrix.
2. Each score, ti , is seen from Eq (23.44) as the projection of the data
onto the basis vector pi . By choosing k < n, the data matrix X is
projected down to a lower dimensional space using pTi as the basis for
the projection.
3. For all intents and purposes, PCA replaces the original m n data
matrix X with a better conditioned m k matrix T in a dierent set of
coordinates in a lower dimensional sub-space of the original data space.
The data may be recovered in terms of the original coordinates after
is now
eliminating the extraneous components from Eq (23.50), where X
an m k matrix.
4. Any collinearity problem in X is solved by this transformation because
the resulting matrix T is made up of orthogonal vectors so that TT T
not only exists, it is diagonal.
The principles and results of PCA are now illustrated with the following example.

23.3.3

Illustrative example

Even to veterans of multivariate data analysis, the intrinsic linear combinations of variables can make PCA and its results somewhat challenging to
interpret. This is in addition to the usual plotting and visualization challenge
arising from the inherent multidimensional character of such data analysis.
The problem discussed here has been chosen therefore specically to demonstrate what principal components, scores and loadings mean in real applications, but in a manner somewhat more transparent to interpret8 .
Problem Statement and Data
The problem involves 100 samples obtained from 16 variables, Y1 , Y2 , . . . , Y16 ,
to form a 100 16 raw data matrix, a plot of which is shown in Fig 23.2. The
primary objective is to analyze the data to see if the dimensionality could
be reduced from 16 to a more manageable number; and to see if there are
any patterns to be extracted from the data. Before going on, the reader is
encouraged to take some time and examine the data plots for any visual clues
regarding the characteristics of the complete data set.
8 The

problem has been adapted from an example communicated to me by M. J. Piovoso.

992

Random Phenomena

0
y1

100

0.0

-0.8

-1.6

-3

100

0
0

-3
y6

-6

-4

0.5

0.0

-1.0

-0.8

-1

-2.5

y10

0.5

0
-2

-1.6

y11

y12
0.5

-0.5

-1

y13

y14
5

0.5
0

-0.5
-1.5

-5
0

-1.0

-1.5

-2

100

-2.5

y15

y16

0.5

0.0

-0.5
-3
0

100

Index

FIGURE 23.2: Plot of the 16 variables in the illustrative example data set.

PCA and Results

The computations involved in PCA are routinely done with computer
software; specically with MINITAB, the sequence, Stat > Multivariate >
Principal Components > opens a dialog box where the problem characteristics are specied. In particular, the data columns are selected (all 16 of them),
along with the number of principal components to be computed (we select 6 to
start with). The type of data matrix form to use is selected as Correlation
(this is the scaled, mean-centered form; the alternative, Covariance, is
mean-centered only). We chose to store the scores and the loadings (Eigenvalues). MINITAB also has several plotting options. The MINITAB results are
shown below, rst the eigenvalues and their respective contributions:
Eigenanalysis of the Correlation Matrix
Eigenvalue 7.7137 4.8076 0.7727 0.5146
Proportion 0.482 0.300 0.048 0.032
Cumulative 0.482 0.783 0.831 0.863
Eigenvalue 0.4517 0.3407 0.2968 0.2596
Proportion 0.028 0.021 0.019 0.016
Cumulative 0.891 0.913 0.931 0.947

Introduction to Multivariate Analysis

993

Eigenvalue 0.2419 0.1608 0.1248 0.1021

Proportion 0.015 0.010 0.008 0.006
Cumulative 0.962 0.973 0.980 0.987
Eigenvalue 0.0845 0.0506 0.0437 0.0342
Proportion 0.005 0.003 0.003 0.002
Cumulative 0.992 0.995 0.998 1.000
The principal components (the loading vectors) are obtained as follows:
Variable
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
y11
y12
y13
y14
y15
y16

PC1
0.273
0.002
0.028
-0.016
0.306
-0.348
0.285
0.351
0.326
0.346
0.019
0.344
0.347
0.036
-0.007
-0.199

PC2
0.005
-0.384
0.411
0.381
-0.031
0.001
0.002
-0.018
0.027
-0.020
0.427
-0.025
-0.014
0.427
-0.412
0.030

PC3
-0.015
0.151
0.006
0.155
-0.180
0.055
-0.247
-0.030
-0.014
-0.066
-0.002
0.000
0.011
0.066
0.023
-0.920

PC4
0.760
-0.253
-0.111
0.239
-0.107
0.119
0.312
-0.029
-0.252
-0.063
-0.098
-0.117
-0.193
-0.116
0.142
-0.068

PC5
0.442
0.396
0.125
0.089
0.407
0.009
-0.631
0.019
-0.100
-0.028
0.042
0.029
-0.027
0.124
-0.014
0.175

PC6
0.230
-0.089
0.151
-0.833
0.077
0.172
0.054
0.007
-0.025
-0.037
0.238
-0.198
-0.047
0.070
-0.219
-0.175

These results are best appreciated graphically. First, Fig 23.3 is a Scree
plot, a straightforward plot of the eigenvalues in descending order. The primary characteristic of this plot is that it shows graphically how many eigenvalues (and hence principal components) are necessary to capture most of the
variability in the data. This particular plot shows that after the rst two components, not much else is important. The actual numbers in the eigenanalysis
table show that almost 80% of the variability in the data is captured by the
rst two principal components; the third principal component contributes less
than 5% following the 30% contribution from the second principal component.
This is reected in the very sharp knee at the point k + 1 = 3 in the Scree
plot. The implication therefore is that the information contained in the 16
variables can be represented quite well using two principal components, PC1
and PC2, shown in the MINTAB output table.
If we now focus on the rst two principal components and their associated
scores and loadings, the rst order of business is to plot these to see what
insight they can oer into the data. The individual scores and loading plots
are particularly revealing for this particular problem. Fig 23.4 shows such a
plot for the rst principal component. It is important to remember that the
scores indicate what the new data will look like in the transformed coordinates;
in this case, the top panel indicates that in the direction of the rst principal

994

Random Phenomena
Scree Plot
8
7

Eigenvalue

6
5
4
3
2
1
0
1

7
8
9
10 11
Component Number

FIGURE 23.3: Scree plot showing that the rst two components are the most important.

component, the data set is essentially linear with a positive slope. The loading
plot indicates in what manner this component is represented in (or contributes
to) each of the original 16 variables.
The corresponding plot for the second principal component is shown in Fig
23.5 where we observe another interesting characteristic: the top panel (the
scores) indicates that in the direction of the second principal component, the
data is a downward pointing quadratic; the bottom panel, the loadings associated with each variable, indicates how this quadratic component contributes
to each variable.
Taken together, these plots indicate that the data consists of only two primary modes: PC1 is linear, and the more dominant of the two; the other, PC2,
is quadratic. Furthermore, the loadings associated with PC1 indicate that the
linear mode manifests negatively in two variables, Y6 and Y16 ; the variables
for which the loadings are strong and positive (i.e., Y1 , Y5 , Y7 , Y8 , Y9 , Y10 , Y12
and Y13 ) contain signicant amounts of the linear mode at roughly the same
level. For the other remaining 6 variables, the indication of Fig 23.4 is that
they do not contain any of the linear mode. The story for PC2 is similar but
complementary: the quadratic mode manifests negatively in two variables Y2
and Y15 , and positively in four variables, Y3 , Y4 , Y11 and Y14 . The quadratic
mode contributes virtually nothing to the other variables.
It is now interesting to return to the original data set in Fig 23.2 to compare
the raw data with the PCA results. It is now obvious that other than noise,
the data sets consists of linear and quadratic trends only, some positive, some
negative. The principal components reect these precisely. For example, 10 of
the 16 variables show the linear trends; the remaining 6 show the quadratic
trend. The rst principal component, PC1, reects the linear trend as the

Introduction to Multivariate Analysis

995

5.0

Scores 1

2.5

0.0

-2.5

-5.0
1

50
Index

100

Loadings on PC1
0.4
0.3
0.2

PC1

0.1
0.0
-0.1
-0.2
-0.3
-0.4
y1

y8 y9 y10 y11 y12 y13 y14 y15 y16

Variable

FIGURE 23.4: Plot of the scores and loading for the rst principal component. The
distinct trend indicated in the scores should be interpreted along with the loadings by
comparison to the full original data set in Fig 23.2.

996

Random Phenomena

4
3
2

Scores 2

1
0
-1
-2
-3
-4
-5
1

50
Index

100

Loadings on PC2
0.5
0.4
0.3
0.2
PC2

0.1
0.0
-0.1
-0.2
-0.3
-0.4
y1

y8 y9 y10 y11 y12 y13 y14 y15 y16

Variable

FIGURE 23.5: Plot of the scores and loading for the second principal component. The
distinct trend indicated in the scores should be interpreted along with the loadings by
comparison to the full original data set in Fig 23.2.

Introduction to Multivariate Analysis

997

more dominant and the variables with the linear trends are all identied in
the loadings of the PC1; no variable showing a quadratic trend is included
in this set of loadings. Furthermore, among the variables showing the linear
trend, the slope is negative in Y6 and in Y16 ; it is positive in the others.
This is reected perfectly in the loadings for PC1 where the values associated
with the variables Y6 and Y16 are negative, but positive for the others. In
the same manner the component capturing the downward pointing quadratic
trend, PC2, is associated with two groups of variables: (i) the variables Y2 and
Y15 , whose raw observations show an upward pointing quadratic (hence the
negative values of the loadings associated with PC2); and (ii) the variables,
Y3 , Y4 , Y11 and Y14 , which all show downward pointing quadratic trends; these
all have positive values in the PC2 loadings.
Finally, we show in Fig 23.6, the two-dimensional score and loading plots
for the rst component versus the second. Such plots are standard fare in
PCA. They are designed to show any relationship that might exist between
the scores in the new set of coordinates, and also how the loading vectors of
the rst two principal components are related. For this specic example, the
2-D score plot indicates a distinct quadratic relationship between t1 and t2 .
To appreciate the information encoded in this plot, observe that the scores
associated with PC1 (shown in Fig 23.4) appear linear in the form t1 = a1 x
where x represents the independent variable (because the data matrix has
been mean-centered, there is no need for an intercept). Likewise, the second
set of scores appear quadratic, in the form t2 = a2 x2 (where the exponent is
to be understood as a term-by-term squaring of the elements in the vector x)
so that indeed t2 = bt21 where b = a2 /a21 . This last expression, the relationship
between the two scores, is what the top panel of Fig 23.6 is encoding.
The 2-D loading plot reveals any relationships that might exist between
the new set of basis vectors constituting PC1 and PC2; it invariably leads
to clustering of the original variables according to patterns in the data. In
this particular case, this plot shows several things simultaneously: rst, its
North-South/West-East alignment indicates that in terms of the original data,
these two principal components are pure components: the linear component in
PC1 is pure, with no quadratic component; similarly, PC2 contains a purely
quadratic component. The plot also indicates that the variables Y6 and Y16 ,
cluster together, lying on the negative end of PC1; Y1 , Y5 , Y7 , Y8 , Y9 , Y10 , Y12
and Y13 also cluster together at the positive extreme of the pure component
PC1. The reader should now be able to interpret the vertical segregation and
clustering of the variables showing the quadratic trends.
To summarize, PCA has provided the following insight into this data set:
It contains only two modes: linear (the more dominant) and quadratic;
The 16 variables each contain these modes in pure form: the ones showing
the linear trend do not show the quadratic trend, and vice versa;
The variables can be grouped into four distinct categories: (i) Negative linear (Y6 , Y16 ); (ii) Positive linear (Y1 , Y5 , Y7 , Y8 , Y9 , Y10 , Y12 and

998

Random Phenomena

Score Plot
4
3

Second Component

2
1
0
-1
-2
-3
-4
-5
-5.0

-2.5

0.0
First Component

2.5

5.0

Loading Plot
0.5
y4

0.4

y 11 y 14
y3

Second Component

0.3
0.2
0.1

y 16

0.0

y 1y 7

y9
yy10
13
8
y 5 y12

-0.1
-0.2
-0.3

y2
y 15

-0.4
-0.4

-0.3

-0.2

-0.1
0.0
0.1
First Component

0.2

0.3

0.4

FIGURE 23.6: Scores and loading plots for the rst two components. Top panel: Scores
plot indicates a quadratic relationship between the two scores t1 and t2 ; Bottom panel:
Loading vector plot indicates that in the new set of coordinates, the original variables
contain mostly pure components PC1 and PC2 indicated by a distinctive North/South
and West/East alignment of the data vectors, with like variables clustered together
according to the nature of the component contributions. Compare to the full original
data set in Fig 23.2.

Introduction to Multivariate Analysis

999

Y13 ); (iii) Negative quadratic (Y2 and Y15 ); and (iv) Positive quadratic
(Y3 , Y4 , Y11 and Y14 ).
It is of course rare to nd problems for which the principal components
are as pure as in this example. Keep in mind, however, that this example is a
deliberate attempt to give the reader an opportunity to see rst a transparent
case where the PCA results can be easily understood. Once grasped, such
understanding is then easier to translate to less transparent cases. For example, had one of the variables contained a mixture of the linear and quadratic
trends, the extent of the mixture would have been reected in the loadings for
each of the scores: the length of the bar in Fig 23.4 would have indicated how
much of the linear trend it contains, with the corresponding bar in Fig 23.5
indicating the corresponding relative amount of the quadratic trend. The 2-D
loading vector for this variable will then lie at an angle (no longer horizontal
or vertical) indicative of the relative contribution of the linear PC1 and that
of PC2 to the raw data.
Additional information especially about implementing PCA in practice
may be found in Esbensen (2002)9 , Naes et al., (2002)10 , Brereton (2003)11 ,
and in Massart et al., (1988)12 .

23.3.4

Other Applications of PCA

Multivariate Process Monitoring

When the task of ascertaining that a process is in control requires simultaneous monitoring of several process variables, the single variable charts
discussed in the previous chapter will no longer work because of potential
correlation between variables. PCA can be, and has been, used to handle this
problem. Carrying out PCA on a training data set, X, produces loading
vectors, pi , the matrix of eigenvalues, , and the sample covariance matrix,
S for typical operation. This exercise achieves two important objectives: (i) it
takes care of any correlations in the variables by reordering the data matrix
in terms of scores and orthogonal PC loadings; but more importantly, (ii) it
provides a data-based model of what is considered normal operation. Each
new sample can then be evaluated against normal operation by projecting
unto the loading vectors and analyzing the result in the score space.
are projected to obtain the correspondThus, the new data samples, X,
using the same loading vectors, p ,

ing scores ti and a new error matrix, E,

i
9 K. H. Esbensen, (2002). Multivariate Data AnalysisIn practice (5th Edition), Camo
Process AS.
10 Naes, T., T. Isaksson, T. Fearn and T. Davies (2002). A user-friendly guide to Multivariate Calibration and Classification. Chichester, UK, NIR Publications.
11 Brereton, R. G. (2003). Chemometrics: Data Analysis for the Laboratory and Chemical
Plant, Wiley & Sons.
12 Massart, D. L., B. G. M. Vandeginste, S. N. Deming, Y. Michotte and L. Kaufman
(1988). Chemometrics: A Textbook. Amsterdam, Netherlands, Elsevier.

1000

Random Phenomena

according to:
= t1 pT + t2 pT + . . . + tk pT + E

X
1
2
k

(23.57)

Two statistics are used to assess the new data set against normal operation:

the Q statistic, from the ith row of E,

i
Ti e
Qi = e

(23.58)

is the error sum of squares for the ith sample (also known as the lack-of-modelt statistic); it provides a measure of how well the ith sample conforms to the
PCA model and represents the distance between the sample and its projection
onto the k-dimensional principal components space. A large value implies that
the new data does not t well with the correlation structure captured by the
PCA model.
The second statistic was actually introduced earlier in this chapter: the
Hotelling T 2 statistic,
i = tTi 1ti
Ti S1 x
Ti2 = x

(23.59)

in terms of the original data, and equivalently in terms of the PCA scores
and eigenvalues; it provides a measure of the variation within the new sample
relative to the variation within the model. A large Ti2 value indicates that the
data scores are much larger than those from which the model was developed.
It provides evidence that the new data is located in a region dierent from one
captured in the original data set used to build the PCA model. These concepts
are illustrated in Fig 23.7, adapted from Wise and Gallagher, (1996)13 .
To determine when large values of these statistics are signicant, control
limits must be developed for each one, but this requires making some distributional assumptions. Under normality assumptions, condence limits for the
T 2 statistic are obtained from the F -distribution, as indicated in Eq (23.16);
for Q, the situation is a bit more complicated but the limits are still easily
computed numerically (see e.g., Wise and Gallagher (1996)). Points falling
outside of the control limits then indicate an out-of-control multivariate process in precisely the same manner as with the univariate charts of the previous
chapter. These concepts are illustrated in Fig 23.8 for process data represented
with 2 principal components.
The Wise and Gallagher reference contains, among other things, additional
discussions about the application of PCA in process monitoring.
Model Building in Systems Biology
PCA continues to nd application in many non-traditional areas, with its
inltration into biological research receiving increasing attention. For example, the applications of PCA in regression and in building models have been
extended to signal transduction models is Systems Biology. From multivariate
13 Wise, B.M. and N. B. Gallagher, (1996). The process chemometrics approach to process
monitoring and fault detection, J Process Control, 6 (6) 329348.

Introduction to Multivariate Analysis

1001

FIGURE 23.7: Principal component model

for a 3-dimensional data set described by

two principal components on a plane, showing a point with a large Q and another with
a large T 2 value.

FIGURE 23.8: Control limits for Q and T 2 for process data represented with two
principal components.

1002

Random Phenomena

measurements of various aspects of intracellular signalling molecules, PCA

has been used to nd, among other things, fundamental dimensions that appear to constitute molecular basis axes within the signaling network that
the cell uses for the apoptotic program14. An application of PCA to the analysis of protein conformational space can also be found in Tendulkar, et al.
(2005)15 . Upon visualizing the distribution of protein conformations in the
space spanned by the rst four PCs as a set of conditional bivariate probability distribution plots, the peaks are immediately identied as corresponding
to the preferred protein conformations.
A brief primer on PCA, especially from the perspective of analyzing molecular biology data, is available in Ringner (2008)16 . A review of other applications of PCA and PLSR in deriving biological insights from multivariate
experimental data, along with a discussion of how these techniques are becoming standard tools for systems biology research, is contained in Janes and
Yae (2006)17 .

23.4

Summary and Conclusions

Strictly speaking, this nal chapter in the trilogy of chapters devoted to

showcasing substantial, subject-matter-level applications of probability and
statistics, is dierent from the other two. While Chapters 21 and 22 cover
true applications of probability and statistics, what this chapter covers is
not so much an application as it is an extension to higher dimensions. In
this sense, what Chapters 820 are to Chapter 4 (on the univariate random
variable) this chapter isalbeit in condensed, miniaturized formto Chapter
5. The coverage of multivariate probability models was brief, and even then,
only those whose scalar analogs play signicant roles in univariate analysis
were featured. This device made it easy to place the roles of the multivariate
distributions in multivariate analysis in context quickly and eciently.
In this era of facilitated acquisition and warehousing of massive data sets,
the well-trained scientist and engineer (especially those working in the chemical and other manufacturing industries) must be aware of the intrinsic nature of multivariate data, and of the random variables from which such data
14 Janes, K. A., J. G. Albeck, S. Gaudet, P. K. Sorger, D. A. Lauenburger, and M. B.
Yae (2005): A Systems Model of Signaling Identies a Molecular Basis Set for CytokineInduced Apoptosis Science, 310, 1646-1653.
15 Tendulkar, A. V., M. A. Sohoni, B.A. Ogunnaike, and P. P. Wangikar, (2005). A geometric invariant-based framework for the analysis of protein conformational space, Bioinformatics, 21 (18) 3622-3628.
16 Ringn
er, M. (2008). What is principal component analysis? Nature Biotech. 26(3),
303-304.
17 K. A. Janes and M. B. Yae, (2006). Data-driven modelling of signal-transduction
networks, Nature, 7, 820-828.

Introduction to Multivariate Analysis

1003

aroseand how to analyze such data appropriately. Unfortunately, only the

brief overview presented here is possible in the amount of space available in
such a textbook. But our objective was never comprehensive coverage; rather
it was to make the reader aware that multivariate analysis is more than merely
a multiplication of univariate ideas by n; probability model structures become signicantly more complicated, and only the analysis techniques tailored
to such complex structures can be eective for multivariate problem-solving.
In light of the limited space budget, the amount of time and space allocated to principal components analysis (PCA) might be surprising at rst,
but this is in recognition of how pervasive this technique is becoming (or,
in some cases, has already become) in many modern manufacturing enterprizes, including the traditionally conservative pharmaceutical industry. Still,
a more comprehensive discussion of PCA was not possible. As such, how the
approach itself was presented, and the special illustrative example employed,
were jointly calibrated mostly to promote fundamental understanding of a
technique that is prone to misuse and misinterpretation of its results because
it is not always easy to understand.
Finally, we believe that there is no better way to demonstrate PCA and
to develop an understanding of how to interpret its results than by actually
doing real multivariate data analysis on real data sets. The project assignment
at the end of the chapter oers such an opportunity to apply PCA to a real
molecular biology data set.

REVIEW QUESTIONS
1. What is a multivariate probability model? How is it related to the single variable
pdf?
2. The role of the Gaussian distribution in univariate analysis is played in multivariate analysis by what multivariate distribution?
3. The Wishart distribution is the multivariate generalization of what univariate
distribution?
4. Hotellings T -squared distribution is the multivariate generalization of what univariate distribution?
5. What is the multivariate generalization of the F -distribution?
6. The Dirichlet distribution is the multivariate generalization of what univariate
distribution?
7. What is Principal Components Analysis (PCA) and what is it useful for?
8. In Principal Components Analysis (PCA), what is a loading vector and a
score vector?

1004

Random Phenomena

9. How are the loading vectors in PCA related to the data matrix?
10. How are the score vectors in PCA related to the data matrix and the loading
vectors?
11. In PCA, what is a Scree plot?
12. How is PCA used in process monitoring?

PROJECT ASSIGNMENT
Principal Components Analysis of a Gene Expression Data Set
In the Ringner (2008) reference provided earlier, the author used a set of
microarray data on the expression of 27,648 genes in 105 breast tumor samples
to illustrate how PCA can be used to represent samples with a smaller number of variables, visualize samples and genes, and detect dominant patterns
of gene expression. Only a brief summary of the resulting analysis was presented. The data set, collected by the author and his colleagues, and published
in Saal, et al., (2007)18 , is available through the National Center for Biotechnology Information Gene Expression Omnibus database (accession GSE5325)
and from http://icg.cpmc.columbia.edu/faculty parsons.htm.
Consult the original research paper, Saal, et al., (2007), (which includes
the application of other statistical analysis techniques, such as the MannWhitney-Wilcoxon test of Chapter 18, but not PCA) in order to understand
the research objectives and the nature of the data set. Then download the
data set and carry out your own PCA on it. Obtain the scores and loading
plots for the rst three principal components and generate 2-D plots similar
to those in Fig 23.6 in the text. Interpret the results to the best of your ability.
Write a report on your analysis and results, comparing them where possible
with those in Ringner (2008).

18 Saala, L.H., P. Johansson, K. Holmb, et al. (2007). Poor prognosis in carcinoma is

associated with a gene expression signature of aberrant PTEN tumor suppressor pathway
activity, Proc. Natl. Acad. Sci. USA 104, 7564 7569.

Appendix

Knowledge is of two kinds:

we know a subject ourselves
or we know where we can nd information on it.
Samuel Johnson Cowper (17091784)

Computers and Statistical Computations

In traditional textbooks on probability and statistics, this part of the book
is where one would typically nd tables for generating random numbers, and
for determining (tail area) probabilities for dierent sampling distributions.
The collection usually includes tables of standard normal probabilities, or z
tables; t-tables for various degrees of freedom, ; 2 () tables; F -tables for a
select combination of degrees of freedom; even binomial and Poisson probability tables. Generations of students learned to use these tables to determine
probabilities, a procedure that often requires interpolation, since it is impossible to provide tables dense enough to include all possible variates and
corresponding probabilities.
Things are dierent now. The universal availability of very fast and very
powerful computers of all typesfrom personal desktop machines to servers
and supercomputershas transformed all aspects of scientic computing. So
many software packages have now been developed specically for carrying out
every conceivable computation involved in probability and statistics that many
of what used to be staples of traditional probability and statistics textbooks
are no longer necessary, especially statistical tables. Random numbers can
be generated for a wide variety of distributions; descriptive and graphical
analysis, or estimation, regression analysis and hypothesis tests, and much
more, call all now be carried out with ease. The computer has thus rendered
printed statistical tables essentially obsolete by eliminating the need for them,
but also by making it possible to have them available electronically. In this
book, therefore, we are departing from tradition and will not include any
statistical tables. What we are supplying here instead is a compilation of
useful information about some popular software packages, and, for those who
1005

1006

Appendix

might still want them, on-line electronic versions of statistical tables; we also
include a few other on-line resources that the reader might nd useful.

Statistical Software Packages

We adopted the use of MINITAB in this text primarily because, for the
uninitiated, its learning curve is not steep at all; its drop-down menus are intuitive, and anyone familiar with Excel spreadsheets can learn to use MINITAB
with little or no instruction. Of course, several other packages are equally popular and can be used just as well. Below is a list of just a few of these, and
where to nd further information about them.
1. MINITAB: A commercial software package with a free 30-day trial
download available at:
http://www.minitab.com/downloads/
A student version and academic licensing are also available.
2. R: A free software package for statistical computing and graphics that
is based on a high level language:
http://www.r-project.org/ or
http://cran.r-project.org/.
3. S-Plus: a commercial package based on the S programming language:
http://www.insightful.com/products/splus/
A free 30-day trial CD is available upon request.
4. MATLAB Statistics Toolbox: A commercial toolbox for use with
MATLAB, the general purpose scientic computing software popular
among control engineers:
http://www.mathworks.com/products/statistics/
5. SAS: A commercial package:
http://www.sas.com/technologies/analytics/statistics/
6. JMP: An alternative commercial product from SAS with a free 30-day
trial download available at:
http://www.jmp.com/
Academic licensing is also available.
7. STATISTICA: A commercial package:
http://www.statsoft.com/products/products.htm.

On-line Calculators and Electronic Tables

As noted above, the advent of computer software packages has essentially
eliminated the need for traditional statistical tables. Even then, these tables

Appendix

1007

are now freely available electronically on-line. Some are deployed fully in electronic form, with precise probability or variate values computed on request
from pre-programmed probability distribution functions. Others are electronic only in the sense that the same numbers that used to be printed on
paper are now made available in an on-line table; they still require interpolation. Either way, if all one wants to do is simply compute tail area probabilities
for a wide variety of the usual probability distributions employed in statistical
inference, a dedicated software package is not required. Below is a listing of
three electronic statistical tables, their locations, and a brief summary of their
capabilities.
1. http://stattrek.com/Tables/StatTables.aspx
Conceived as a true on-line calculator, this site provides the capability
for computing, among other things, all manner of probabilities for a
wide variety of discrete and continuous distributions. There are clear
instructions and examples to assist the novice.
2. http://stat.utilities.googlepages.com/tables.htm
(SurfStat statistical tables by Keith Dear and Robert Brennan.)
Truly electronic versions of z-, t-, and 2 tables are available with a
convenient graphical user-interface that allows the user to specify either
the variate and compute the desired tail area probabilities, or to specify
the tail area probability and compute the corresponding variate. The
F -tables are available only as text.
3. http://www.statsoft.com/textbook/sttable.html
Provides actual tables of values computed using the commercial STATISTICA BASIC software; available tables include z-, t-, 2 ; the only
F -tables available are for F (0.1, 0.05) and F (0.025, 0.01). (The site includes animated Java images of the various probability distributions
showing various computed, and constantly changing, cumulative probabilities.)

Other On-line Resources

Below is a listing of other resources available on-line:
1. Glossary: A glossary of statistical terms arranged in alphabetical order
is available at:
http://www.animatedsoftware.com/elearning/
Statistics%20Explained/glossary/se glossary.html
2. Electronic Handbooks: Two of the most comprehensive and most
useful are listed below:
(a) The NIST/SEMATECH e-Handbook of Statistical Methods:
http://www.itl.nist.gov/div898/handbook/, and

1008

Appendix
(b) The StatSoft Electronic Statistics Textbook:
http://www.statsoft.com/textbook/stathome.html

3. Data sets: The following site contains links to a wide variety of statistical data, categorized by subject area and government agency. In
addition, it provides links to other sites containing statistical data.
http://www.libraryspot.com/statistics.htm
Also, NIST, the National Institute of Standards and Technology, has a
site dedicated to reference data sets with certied computational results.
The original purpose was to enable the objective evaluation of statistical
software. But instructors and students will nd the data sets to be a good
source of extra exercises (based on certied data).
http://www.itl.nist.gov/div898/strd/index.html

Index

acceptance sampling, 222

acceptance sampling plans, 937
acceptance/rejection criteria, 937
action potentials, 269, 772
aerial bombardment of London, 869
aggregate descriptions, 4, 409
aggregate ensemble, 68, 100
Albeck, J. G., 1002
alternative hypothesis, 554
Anderson, T. W., 771
Anderson-Darling
statistic, 736
test, 736
Anderson-Darling test, 771, 775
more sensitive that K-S test, 772
test statistic, 771
ANOVA
xed, random, mixed eects, 803
ANOVA (Analysis of Variance), 311, 605,
796
central assumptions, 796
identity, 672, 800
ANOVA tables, 675
one-way classication, 801
Antoines equation, 728
arithmetic average, 14
Asprey, S., 839
Assisted Reproductive Technology (ART),
364
Atkinson, A., 839
Atwood, C.L., 251
automatic process control, see engineering process control
Avandia , 138, 141
average deviation, 440
Balakrishnan, N., 979
bar chart, 419
bar graph, see bar chart
baroreceptors, 929

Bayes estimator, 520

Bayes rule, 76
Bayes Theorem, 519
Bayes, Revd. Thomas, 77, 519
Bayesian estimation, 518527, 862
recursive, 525
application to Prussian army
data, 860
Bequette, B. W., 976
Bernoulli random variable, 221, 761
mathematical characteristics, 222
probability model, 221
Bernoulli trial, 221, 225, 372
Bernoulli, Jakob, 198
beta distribution, 522, 983
many shapes of, 303
beta random variable, 301308
application, 305
in quality assurance, 305
generalized, 307
inverted, 308
mathematical characteristics, 302
probability model, 302
relation to gamma random variable,
301
Bibby, J. M., 982
binary classication tests, 558
sensitivity, 559
specicity, 559
binomial distribution, 761
binomial random variable, 225230, 280,
372
applications, 227
inference, 228
mathematical characteristics, 226
probability model, 226
relation to other random variables,
227
Birtwistle, M. R., 269, 330, 453, 839
Bisgaard, S., 851

1009

1010
bivariate Gaussian distribution, 980
bivariate random variable, 139
continuous, 149
denition, 139
discrete, 150
informal, 140
blocking, 805
blood pressure control system, mammalian, 929
Boltzmann, L. E., 353
Borel elds, 68
box plot, 427, 429
Box, G. E. P., 731, 811, 834, 851
Brauer, F., 890
Burman, J. P., 832
C-chart, 957
calculus of variations, 345
calibration curves, 659
Cardano, 198
Castillo-Chavez, C., 890
catalyst, 12
Cauchy distribution, 786
application in crystallography, 789
model for price uctuations, 786
Cauchy random variable, 314
application
high-resolution price uctiations,
316
mathematical characteristics, 315
probability model, 314
relation to other random variables,
316
cell culture, 173
Central Limit Theorem, 288, 468, 471,
608, 947
chain length distributions, 235
most probable, 235
Chakraborti, S, 778
characteristic function, 115116, 177
inversion formula, 116
characteristic parameters, 409
Chebyshevs inequality, 121, 229, 492
chemical engineering
illustration, 35
principles, 38
chemical process control, 964969
chemical reactors, 35
chi-square random variable

Index
application, 272
special case of gamma random variable, 271
Chi-squared goodness-of-t test, 739745
relation to z-test, 745
chi-squared test, 601
Chinese Hamster Ovary (CHO) cells,
330, 361, 453
Clarke, R. D., 868
co-polymer composition, 140
coecient of determination
adjusted, Ra2 dj, 673
coecient of determination, R2 , 672, 673
coecient of variation, 109
commercial coating process, 879
completely randomized design, 798
balanced or unbalanced, 799
complex system
crosslinks, 909
series-parallel conguration, 906
complex variable, 115
computer programs, 426
conditional distribution
general multivariate, 153
conditional expectation, 156
bivariate, 156
conditional mean, 157
conditional probability
empirical, 209
conditional probability distribution
denition, 147
conditional variance, 157
condence interval, 509
around regression line, 668
relation to hypothesis tests, 575
condence interval, 95%, 507
mean of normal population, 507
on the standard deviation, 510
condence intervals
in regression, 661
conjugate prior distribution, 862
consistent estimator, 492
constrained least squares estimate, 696
continuous stirred tank reactor (CSTR),
35, 193, 261, 325, 361
control charts, 946
C-chart, 957
CUSUM charts, 961
EWMA charts, 963

Index

1011

graphical means of testing hypotheses, 946

I and MR chart, 952
P-chart, 954
S-chart, 950
Shewhart chart, 947
Western Electric Rules, 960
Xbar-R chart, 951
Xbar-S chart, 950
control hardware electronics, 143
convolution integrals, 179
Corbet, S., 255
correlation coecient, 158
Coughlin, R., 210
covariance, 157
covariance matrix, 689
Cramer-Rao inequality, 492
Cramer-Rao lower bound, 492
critical region, 556
cumulative distribution function, 96, 99,
100
bivariate, 142
cumulative hazard function, 124, 272
CUSUM charts, 961

application, 220
model, 220
distribution
conditional, 147
joint, 141
joint-marginal, 156
marginal, 144, 145, 156
multimodal, 117
posterior, 520
prior, 520
symmetric, 109
unimodal, 117
distributions, 95, 107
leptokurtic , 111
moments of, 107
of several random variables, 141
platykurtic, 111
relationship between joint, conditional, marginal, 519
DNA replication origins
distances between, 269
Donev, A., 839
Doyle, III F. J., 976
Draper, N. R., 834

Dantzig, T., 363

Darling, D. A., 771
data
nominal, 418
ordinal, 418
data characteristics, 436442
central location, 436
variability, 440
de Laplace, Pierre-Simon, 279
de Moivre, Abraham, 198, 279
death-by-horse kicks, 858
deaths in London, 41
DeMorgans Laws, 62
Desemone, J., 976
design of experiments, 415
deterministic, 14
expression, 34
idealization, 2
phenomena, 3
dierential entropy, 342
Dirac delta function, 38
Dirichlet distribution, 983
Dirichlet,J. P. G. L., 983
discrete uniform random variable, 219

economic considerations, 12
eciency, 491
ecient estimator, 491
empirical frequencies, 205
engineering process control, 966
feedback error, 966
Proportional-Integral-Derivative (PID)
controllers, 966
engineering statistics, 2
ensemble, 41
entropy, 119, 338344
of Bernoulli random variable, 340
of continuous random variable, 340
dierential entropy, 342
of deterministic variable, 339
of discrete uniform random variable, 339
Erlang distribution, see gamma distribution
Erlang, A. K., 266
estimation, 489
estimator, 489
estimator characteristics
consistent, 492

1012
ecient, 491
unbiased, 490
estimators
criteria for choosing, 490493
method of moments, 493
sequence of, 492
Euler equations, 345, 346, 348350
event, 59
certain, 61, 63
complex, compound, 60
impossible, 61, 63, 67
simple, elementary, 60
events
compound, 64
elementary, 64
mutually exclusive, 63, 69, 78
EWMA charts, 961
related to Shewhart and CUSUM
charts, 963
expected value, 102
denition, 105
properties
absolute convergence, 105, 106
absolute integrability, 105, 106
experiment, 58
conceptual, 58, 64
experimental studies
phases of, 794
exponential distribution
memoryless distribution, 263
discrete analog of geometric distribution, 261
exponential pdf
application in failure time modeling, 913
exponential random variable, 260264
applications, 263
mathematical characteristics, 262
probability model, 261
special case of gamma random variable, 266
extra cellular matrix (ECM), 361
F distribution, 311, 474
F random variable, 309311
application
ratio of variances, 311
denition, 309
mathematical characteristics, 310

Index
probability model, 310
F-test, 604
in regression, 674
sensitivity to normality assumption,
605
factor levels, 795
factorial designs, 2k , 814
characteristics
balanced, orthogonal, 816
model, 816
sample size considerations, 818
factorial experiments, 814
factors, 795
failure rate, 911
failure times
distribution of, 913
failure-rate
decreasing, model of, 914
increasing, model of, 914
Fermat, 198
rst order ODE, 38
rst-principles approach, 2
Fisher information matrix, 837, 848
Fisher, R. A., 209, 255, 309, 541, 858
uidized bed reactor, 325
Fourier transforms, 116
fractional factorial designs
alias structure, 824
dening relation, 824
design resolution, 824
folding, 826
projection, 825
frequency distribution, 18, 20, 424, 426
frequency polygon, 426
frequentist approach, 426
functional genomics, 305
Gallagher, N. B., 1000
gamma distribution, 266
generalization of Erlang distribution, 266
model for distribution of DNA replication origins, 269
gamma pdf
application in failure time modeling, 917
gamma random variable, 180, 181, 264
271, 462
applications, 269

Index
generalization of exponential random variable, 264
mathematical characteristics, 266
probability model, 265
reproductive property of, 181
Garge, S., 853
Gaudet, S., 1002
Gauss, J.C.F., 279
Gaussian distribution, 654
bivariate, 980
Gaussian probability distribution, 288
Gaussian random variable, 279292
applications, 290
basic characteristics, 288
Herschel/Maxwell model, 285287
limiting case of binomial random
variable, 280
mathematical characteristics, 288
misconception of, 288
probability model, 288
random motion in a line, 282285
Gelmi, C. A., 306
gene expression data set, 1004
genetics, 199
geometric random variable, 234
application, 235
mathematical characteristics, 234
probability model, 234
geometric space, 44
Gibbons, J. D., 778
Gossett, W. S., 312
Gram polynomials, 706
granulation process, 296, 297
graphical techniques, 442
Graunt, John, 41
gravitational eld, 44
Greenwood, M., 252
group classication, 18
hazard function, 123, 272, 911
bathtub curve, 912
equivalently, failure rate, 912
heat transfer coecient, 34
Hendershot, R. J, 837
hereditary factors, 203
genes, 203
heredity
dominance/recessiveness, 203
genotype, 204

1013
phenotype, 204
Heusner, A. A., 729
Hirschfelder-Curtiss-Bird, 638
histogram, 18, 424
for residual errors, 678
Hoerl, A.E., 697
Hotelling T 2 statistic, 1000
Hotellings T -squared distribution, 982
Hotelling, H., 982
housekeeping genes, 764
Hunter, J. S., 811
Hunter, W. G., 811
Huygens, 198
hydrocarbons, 455
hypergeometric random variable, 222
application, 224
mathematical characteristics, 224
probability model, 223
hypothesis
alternative, Ha , 560
null, H0 , 560
one-sided, 555
two-sided, 554
hypothesis test, 555
p-value, 560
error
Type I, 557
Type II, 558
general procedure, 560
non-Gaussian populations, 613
power and sample size determination, 591600
power of, 558
risks
-risk, 558
-risk, 558
two proportions, 611
using MINITAB, 573
hypothesis test, signicance level of, 557
hypothesis testing
application to US census data, 876
in regression, 664
ideal probability models, 4
in-vitro fertilization (IVF), 42, 101, 225,
413
binomial model
sensitivity analysis of, 393
binomial model for, 372, 377

1014
binomial model validation, 375
Canadian guidelines, 370
central characteristics, 371
clinical data, 367
clinical studeis, 367
Elsner clinical data, 375
Elsner clinical study, 369
Elsner study
study characteristics, 375
guidelines and policy, 370
implantation potential, 369, 372
mixture distribution model, 382
model-based optimization, 384
multiple births, 365
risk of, 366
oocyte donation, 364
optimization problem, 385
optimum number of embryos, n ,
385, 387
patient categorization, 390
SEPS parameter
non-uniformity, 380
single embryo probability of success
parameter (SEPS), 372
treatment cycle, 372
treatment outcomes
theoretical analysis of, 390
in-vitro fertilization (IVF) treatment
binomial model of, 397
model-based analysis of, 393
inclusions, 16
independence
pairwise, 78
stochastic, 158
information
quantifying content of, 337
relation to uncertainty, 336
information content, 119
interspike intervals
distribution of, 772
interval estimate, 490
interval estimates, 506518
dierence between the two population means, 512
for regression parameters, 661
mean, unknown, 508
non-Gaussian populations, 514
variance of normal population, 510
interval estimation, 489

Index
interval estimator, 490
inverse bivariate transformation, 182
inverse gamma random variable, 325, 550
inverse transformation, 172
Jacobian
of bivariate inverse transformation,
182
of inverse transformation, 175, 176
Janes, K. A., 1002
Johnson, N. L., 979
joint probability distribution
denition, 142
joint probability distribution function,
144
Jovanovic, L., 976
Kalman lter, 700
kamikaze pilots, 209
Kent, J. T., 982
keystone component, 910
Kholodenko, B. N., 839
Kimball, G. E., 210
Kingman, J.F.C, 68
Kleibers law, 194, 729
Kolmogorov, 67, 98, 198
Kolmogorov-Smirnov (K-S) test, 770
test statistic, 771
Konishi, S., 970
Kotz, S., 979
Kruskall-Wallis test, 805
nonparametric one-way ANOVA,
805
kurtosis, 111
coecient of, 111
Lagrange multiplier, 345, 346, 353
Lagrangian functional, 345
Laplace, 198, 519
Laplace transforms, 116
Lauenburger, D. A., 1002
Lauterbach, J., 837
law of large numbers, 229
least sqares
estimator
properties, 660
unbiased, 660
least squares
constrained, 696

Index
estimator, 652
method of, 653
ordinary, 654
principles of, 651, 652
recursive, 698, 699
weighted, 694
Lenth, R. V., 829
life tests, 919
accelerated tests, 919
nonreplacement tests, 919
replacement tests, 919
test statistic, 920
truncated tests, 919
life-testing, 275
likelihood function, 497
likelihood ratio tests, 616623
linear operator, 107
log-likelihood function, 498
logarithmic series distribution, 255
logarithmic series random variable, 248
logistic distribution, 329
(standard), 326
lognormal distribution, 293
multiplicative characteristics, 294
lognormal random variable, 292297
applications, 296
central location of
median more natural, 296
mathematical characteristics, 293
probability model, 293
relationship to Gaussian random
variable, 293
loss function
quadratic, 971
Lucas, J. M., 482, 750, 963
Macchietto, S., 839
macromolecules, 113
Malaya, butteries of, 255, 541
Mann-Whitney U test statistic, 767
Mann-Whitney-Wilcoxon (MWW) test,
766769
nonparameteric alternative to 2sample t-test, 769
manufactured batch, 66
Mardia, K. V., 982
marginal expectation, 156
marginal probability distribution
denition, 145

1015
marginal variance, 156, 157
Markov Chain Monte Carlo (MCMC),
527
Markovs inequality, 121
Marquardt, D.W., 698
material balance, 38
mathematical biology, 209
mathematical expectation, 102, 154
marginal, 156
of jointly distributed random variables, 154
maximum `
a-posteriori (MAP) estimate,
863
maximum `
a-posteriori (MAP) estimator,
520
maximum entropy distribution, 346
beta pdf, 351, 354
continuous uniform pdf, 348
discrete uniform pdf, 352
exponential pdf, 349, 352
gamma pdf, 354
Gaussian pdf, 350, 352
geometric pdf, 347, 352
maximum entropy models, 344354
from general expectations, 351
maximum entropy principle, 344
maximum likelihood, 496503
maximum likelihood estimate, 498
characteristics, 501
Gaussian population parameters,
500
in regression, 657
maximum likelihood principle, 616
mean, 437
limiting distribution of, 467
sample, 438
sampling distribution of, 465
normal approximation, 468
standard error of, 467
mean absolute deviation from the median (MADM), 440
mean-time-between-failures (MTBF), 913
median, 117, 118, 437
sample, 439
melt index, 140
Mendels experiments
multiple traits
pairwise, 205
pea plants

1016
characteristics, 199
genetic traits, 200
results, 207
single traits, 201
Mendel, Gregor, 199
method of moments, 493496
method of moments estimators
properties
consistent, 496
not unique, 496
microarray
reference spot, 194
test spot, 193
microarray data, 306
fold change ratio, 194
mixture distributions
Beta-Binomial, 328, 382
application to Elsner data, 400
Poisson-Gamma, 278
mode, 117, 438
sample, 439
molecular weight
z average, Mz , 113
distributions (MWD), 113
non-uniform, 113
number average, Mn , 113
weight average, Mw , 113
molecular weight distribution, 140
moment generating function, 113
independent sums, 115
linear transformations, 115
marginal, 156
uniqueness, 114
moments, 107
kth ordinary, 107
central, 108
rst, ordinary, 107
second central, 108
monomer molecules, 41
Mooney viscosity, 952
Morse, P. M., 210
multinomial random variable, 231
multivariate normal distribution, 980
multivariate probability model, 978
multivariate process monitoring, 999
multivariate random variable
denition, 141
multivariate transformations, 184
non-square, 185

Index
overdened, 185
underdened, 185
square, 184
Mylar , 192
negative binomial distribution, 233
as the Poisson-Gamma mixture distribution, 278
negative binomial random variable, 232
234
mathematical characteristics, 233
probability model, 232
alternative form, 233
Nelson, W. B., 919
nonparametric methods, 759
robust, 759
normal approximation, 468
normal distribution, see Gaussian probability distribution
normal equations, 655
matrix form of, 689
normal population, 471
normal probability plot, 735
application in factorial designs, 829
normality test
for residual errors, 678
null hypothesis, 554
Ogunnaike, B. A., 618, 721, 728,
839, 929, 952, 966, 1002
one-sample sign test, 760
nonparametric alternative to
sample t-test, 763
nonparametric alternative to
sample z-test, 763
test statistic, 761
sampling distribution, 761
operating characteristic curve, 939
relation to power and sample
942
opinion pollster, 488
optimal experimental designs, 837
A-, D-, E-, G-, and V-, 838
Ospina, P., 976
outcome, 58
outcomes
equiprobable, 76
outlier, 428
overdispersion, 242

837,

oneone-

size,

Index
Owens, C., 976
P-chart, 954
p-value, 560
boderline, 623
observed signicance level, 560
P
olya, George, 233
paper helicopter, 891
Pareto chart, 421
Pareto random variable, 328
Pareto, V. F., 421
particle size distribution, 113, 296
particulate products, 113
parts, defective, 66
Pascal distribution, 233
Pascal, Blaise, 198, 233
pdf, 100
Pearson, 18
Pearson, K., 985
percentiles, 119
phenomenological mechanisms, 3
Philadelphia Eagles, 785
point dierential, 2008/2009 season,
785
points scored by, 2008/2009 season,
785
pie chart, 421
Plackett, R.L., 832
plug ow reactor (PFR), 35
point estimate, 489
point estimates
precision of, 503506
binomial proportion, 505
mean, known, 504
mean, unknown, 505
variance, 505
point estimation, 489
Poisson distribution, 496
Poisson events, 260
Poisson model, 859
Poisson random variable, 173, 174, 236
243, 463
applications, 240
limiting form of binomial random
variable, 236
mathematical characteristics, 239
overdispersed, 242
negative binomial model, 242
probability model, 237, 239

1017
probabilty model
from rst principles, 237
Poisson-Gamma mixture distribution,
276278
Polya distribution, see negative binomial
distribution
polydispersity index (PDI), 113
polymer reactor, 143
polymer resin, 63
polymeric material, 113
polymerization
free radical, 235
polynomial regression, 701
orthogonal, 704
population, 411, 488
dichotomous, 222
posterior distribution, 520, 522, 863
Pottmann, M., 618, 929
power and sample size
computing with MINITAB, 599
pre-image, 91
prediction error, 668
prediction intervals, 668
principal component
loading vectors, 986
score vectors, 987
principal components, 985
principal components analysis (PCA),
985
application in systems biology, 1000
scale dependent, 986
Scree plot, 990
prior distribution, 520
uniform, 523
probabilistic framework, 3, 47
probability, 43, 69
a-posteriori, 77
`
a-priori, 77
`
application by Mendel, 204
bounds, 119
general lemma, 120
calculus of, 68, 69
classical `
a-priori, 45
conditional, 7274, 147
equiprobale assignment of, 70
mathematical theory of, 67
relative frequency `
a-posteriori, 46
set function, 67, 90
bivariate, 139

1018
induced, 90, 95
subjective, 46
theory, 414
total, 74
Theorem of, 76
probability density function
denition, 98
joint, 142
probability distribution function, 23, 43,
47, 78, 96, 100
conditional, 147
convolution of, 179
denition, 98
joint, 142
marginal, 145
probability distributions
chart of connections, 319
probability model validation, 732
probability paper, 734
probability plots, 733739
modern, 736
probability theory, 58
application to in-vitro fertilization,
395
probabilty
total
Theorem of, 910
process
chemical, 12
chemical manufacturing, 2
manufacturing, 16, 43, 71
yield, 12
process control, 410, 944
process dynamics, 275, 410
process identication, 410
product law of reliabilities, 903
product law of unreliabilities, 904
product quality, 16
propagation-of-errors
application, 194
Q statistic, 1000
quality assurance, 16
quantiles, 119
quantization error, 342
quartiles, 118
random components, 11
random uctuations, 205

Index
random mass phenomena, 41
random phenomena, 3
random sample, 460
random variability, 14
random variable, 90, 103, 412
n-dimensional, 141, 172
bivariate, 139
continuous, 146
continuous, 94
denition, 90
discrete, 94
entropy
denition, 119
entropy of, 338
informal, 94
kurtosis, 111
mechanistic underpinnings of, 218
moments
practical applications, 113
multivariate, 164, 978
ordinary moment of, 107
realizations, 409
skewness, 109
two-dimensional, 95, 143
random variable families
Gamma family, 259
generalized model, 276
Gaussian family, 278
generalized model, 300
Ratio family, 300
random variable space, 96, 103
bivariate, 139
random variable sums
pdfs of, 177
cdf approach, 177
characteristic function approach,
177, 180
random variable transformations
bivariate, 182
continuous case, 175
discrete case, 173
general continuous case, 176
non-monotone, 188
single variable, 172
random variables
mutually stochastically independent,
161
continuous, 98
discrete, 95

Index
inter-related, 139
negatively correlated, 158
positively correlated, 158
randomized complete block design, 805
Ray, W. H., 728, 952, 966
Rayleigh distribution, 298
relationship to Weibull distribution,
299
Rayleigh random variable, 297
application, 300
probability model, 298
regression
multiple linear, 686
matrix methods, 688
regression line, 663
regression model
mean-centered, 677
one-parameter, 653
two-parameter, 653
regression parameters, 661
rejection region, 556
relative frequency, 18, 424
relative sensitivity function, 394
reliability, 225, 900
component, 901
denition, 900
system, 901
reliability function, 911
residence time, 35
residence time distribution, 193
exponential, 38
residual
standardized, 679
residual analysis, 678682
residual error, 658, 678
residual sum-of-squares, 662
response, 795
response surface designs, 834
application to process optimization,
879
Box-Behnken, 835
face centered cube, 835
ridge regression, 697
Riemann integral, 98
Ringner, M., 1002
Rogers, W. B., 837
Ross, P.J., 970
Ryan, T.P., 837

1019
S-chart, 948
sample, 411
sample range, 440
sample space, 59, 61, 68, 103, 412
discrete, 59, 69
sample variance, 441
sampling distribution, 462
of single variance, 473
of two variances, 474
scatter plot, 431, 648
Schwaber, J. S., 618, 929
screening designs
fractional factorial, 822
Plackett-Burman, 833
sensitivity function, 684
set
empty, null, 61
set function
additive, 66, 67
probability, 67
set functions, 65
set operations, 61
sets, 61
algebra, 61
complement, 61
disjoint, 63, 66
intersection, 61
partitioned, 75
union, 61
Sherwood, T.K., 724
Shewhart, W. A., 947
sickle-cell anemia, 252
signal-to-noise ratio, 597
signed ranks, 763
simple system, 901
parallel conguration, 904
series conguration, 903
single factor experiments, 797811
completely randomized design, 798
data layout, 798
model and hypothesis, 798
two-way classication, 805
randomized complete block design, 805
single proportions, exact test, 610
skewness, 109
coecient of, 109
Snively, C. M., 837
Sohoni, M. A., 1002

1020
Sorger, P. K., 1002
specication limits, 946
dierent from control limits, 946
standard Cauchy distribution, 315
standard deviation, 109
pooled, 579
sample, 441
standard normal distribution, 290, 468,
471
standard normal random variable, 290
mathematical characteristics, 292
relationship to the Chi-square random variable, 292
standard uniform random variable, 308
statistic, 461
statistical hypothesis, 554
statistical inference, 412, 470
in life testing, 919
statistical process control, 944
statistics, 415
descriptive, 415
graphical, 416
numerical, 416
inductive, 415
inferential, 415
Stirzaker, D., 130
stochastic independence, 162
denition, 160
mutual, 163
pairwise, 163
Students t random variable, 311314
application, 314
denition, 312
mathematical characteristics, 312
probability model, 312
Students t-distribution, 471
sum-of-squares function, 653
survival function, 122, 911
system reliability function, 918
t-distribution, 312, 314, 471, 508
t-test, 570
one-sample, 571
paired, 586
two-sample, 579, 581
Taguchi, G., 969
Taylor series approximation, 195
Taylor series expansion, 114
Taylor, S.J., 68

Index
temperature control system
reliability, 143
Tendulkar, A. V., 1002
test statistic, 556
theoretical distribution, 22
thermal conductivity, 455
thermocouple calibration, 194
Thermodynamics, 44
time-to-failure, 269, 275
Tobias, R., 839
total quality management, 935
transformations
monotonic, 175
non-monotonic, 176
nonlinear, 174
single variable, 172
treatment, 795
trial, 59
trinomial random variable, 230
probability model, 230
Tukey, J., 427
two-factor experiments, 811
model and hypothesis, 812
randomized complete block design,
two-way crossed, 812
two-sample tests, 576
unbiased estimator, 490
uncertainty, 11
uniform random variable, 176
uniform random variable (continuous)
application
random number generation, 309
mathematical characteristics, 308
probability model, 308
relation to other random variables,
309
special case of beta random variable, 308
universal set, 62
US legal system
like hypothesis test, 556
US population, 435
age distribution, 456
US Population data, 870
variability, 2
common cause, special cause, 945
variable

Index
dependent, 650
discrete, 17
qualitative, 417
quantitative, 417
variable transformation
in regression, 685
variables
dependent, 795
independent, 795
nuisance, 811
variance
denition, 108
sample, 441
sampling distribution of, 473
Venn diagram, 66
von Bortkiewicz, L., 857
Wall Street Journal (WSJ), 364
Wangikar, P. P., 1002
Weibull pdf
application in failure time modeling, 914
Weibull random variable, 272275
applications, 275
mathematical characteristics, 274
probability model, 273
relation to exponential random variable, 272
Weibull, Waloddi, 273
weighted average, 75
Welf, E. S., 361
Westphal, S. P., 364
Wilcoxon signed rank test, 763
normal approximation not recommended, 764
restricted to symmetric distributions, 763
Wilks Lambda distribution, 982
Wilks, S. S., 982
Wise, B.M., 1000
Wishart distribution, 981
multivariate generalization of 2
distribution, 981
Wishart, J., 981
World War II, 209
Xbar-R chart, 951
Yae, M. B., 1002

1021
yield improvement, 12
Yule, G. U., 252
z-score, 290, 564
z-shift, 592
z-test, 563
one-sample, 566
single proportion, large sample, 608
two-sample, 577
Zisser, H., 976
Zitarelli, D. E., 210