Random Phenomena Text
Random Phenomena Text
Random Phenomena Text
Ogunnaike
Random Phenomena
Fundamentals and Engineering Applications of
Probability & Statistics
Random Phenomena
Fundamentals and Engineering Applications of Probability & Statistics
I frame no hypothesis; for whatever is not deduced from the phenomenon is to be called a hypothesis; and hypotheses, whether
metaphysical or physical, whether of occult qualities or mechanical, have no place in experimental philosophy.
Sir Isaac Newton (16421727)
In Memoriam
19312005
Some who only search for silver and gold
Soon nd what they cannot hold;
You searched after Gods own heart,
and left behind, too soon, your pilgrims chart
ii
Preface
In writing this book, I have been particularly cognizant of these basic facts
of 21st century science and engineering. And yet while most scientists and engineers are well-trained in problem formulation and problem solving when all
the entities involved are considered deterministic in character, many remain
uncomfortable with problems involving random variations, if such problems
cannot be idealized and reduced to the more familiar deterministic types.
Even after going through the usual one-semester course in Engineering Statistics, the discomfort persists. Of all the reasons for this circumstance, the most
compelling is this: most of these students tend to perceive their training in
statistics more as a set of instructions on what to do and how to do it, than as
a training in fundamental principles of random phenomena. Such students are
then uncomfortable when they encounter problems that are not quite similar
to those covered in class; they lack the fundamentals to attack new and unfamiliar problems. The purpose of this book is to address this issue directly by
presenting basic fundamental principles, methods, and tools for formulating
and solving engineering problems that involve randomly varying phenomena.
The premise is that by emphasizing fundamentals and basic principles, and
then illustrating these with examples, the reader will be better equipped to
deal with a range of problems wider than that explicitly covered in the book.
This important point is expanded further in Chapter 0.
iii
iv
v
The core of statistics is presented in Part IV (Chapters 1220). Chapter
12 lays the foundation with an introduction to the concepts and ideas behind
statistics, before the coverage begins in earnest in Chapter 13 with sampling
theory, continuing with statistical inference, estimation and hypothesis testing, in Chapters 14 and 15, and regression analysis in Chapter 16. Chapter
17 introduces the important but oft-neglected issue of probability model validation, while Chapter 18 on nonparametric methods extends the ideas of
Chapters 14 and 15 to those cases where the usual probability model assumptions (mostly the normality assumption) are invalid. Chapter 19 presents an
overview treatment of design of experiments. The third and nal set of case
studies is presented in Chapter 20 to illustrate the application of various aspects of statistics to real-life problems.
Part V (Chapters 2123) showcases the application of probability and
statistics with a hand-selected set of special topics: reliability and life testing
in Chapter 21, quality assurance and control in Chapter 22, and multivariate
analysis in Chapter 23. Each has roots in probability and statistics, but all
have evolved into bona de subject matters in their own rights.
Key Features
Before presenting suggestions of how to cover the material for various audiences, I think it is important to point out some of the key features of the
textbook.
1. Approach. This book takes a more fundamental, rst-principles approach to the issue of dealing with random variability and uncertainty in
engineering problems. As a result, for example, the treatment of probability
distributions for random variables (Chapters 810) is based on a derivation of
each model from phenomenological mechanisms, allowing the reader to see the
subterraneous roots from which these probability models sprang. The reader is
then able to see, for instance, how the Poisson model arises either as a limiting
case of the binomial random variable, or from the phenomenon of observing
in nite-sized intervals of time or space, rare events with low probabilities of
occurrence; or how the Gaussian model arises from an accumulation of small
random perturbations.
2. Examples and Case Studies. This fundamental approach note above
is integrated with practical applications in the form of a generous amount
of examples but also with the inclusion of three chapter-length application
case studies, one each for probability, probability distributions, and statistics.
In addition to the usual traditional staples, many of the in-chapter examples
have been drawn from non-traditional applications in molecular biology (e.g.,
DNA replication origin distributions; gene expression data, etc.), from nance
and business, and from population demographics.
vi
3. Computers, Computer Software, On-line resources. As expanded
further in the Appendix, the availability of computers has transformed the
teaching and learning of probability and statistics. Statistical software packages are now so widely available that many of what used to be staples of
traditional probability and statistics textbookstricks for carrying out various computations, approximation techniques, and especially printed statistical
tablesare now essentially obsolete. All the examples in this book were carried out with MINITAB, and I fully expect each student and instructor to have
access to one such statistical package. In this book, therefore, we depart from
tradition and do not include any statistical tables. Instead, we have included
in the Appendix a compilation of useful information about some popular software packages, on-line electronic versions of statistical tables, and a few other
on-line resources such as on-line electronic statistics handbooks, and websites
with data sets.
4. Questions, Exercises, Application Problems, Projects. No one feels
truly condent about a subject matter without having tackled (and solved!)
some problems; and a useful textbook ought to provide a good selection that
oers a broad range of challenges. Here is what is available in this book:
Review Questions: Found at the end of each chapter (with the exception
of the chapters on case studies), these are short, specic questions designed to test the readers basic comprehension. If you can answer all the
review questions at the end of each chapter, you know and understand
the material; if not, revisit the relevant portion and rectify the revealed
deciency.
Exercises: are designed to provide the opportunity to master the mechanics behind a single concept. Some may therefore be purely mechanical in the sense of requiring basic computations; some may require lling in the steps deliberately left as an exercise to the reader; some may
have the avor of an application; but the focus is usually a single aspect
of a topic covered in the text, or a straightforward extension thereof.
Application Problems: are more substantial practical problems whose
solutions usually require integrating various concepts (some obvious,
some not) and deploying the appropriate set of tools. Many of these are
drawn from the literature and involve real applications and actual data
sets. In such cases, the references are provided, and the reader may wish
to consult some of them for additional background and perspective, if
necessary.
Project assignments: allow deeper exploration of a few selected issues
covered in a chapter, mostly as a way of extending the coverage and
also to provide opportunities for creativity. By denition, these involve
a signicant amount of work and also require report-writing. This book
oers a total of nine such projects. They are a good way for students to
vii
learn how to plan, design, and execute projects and to develop writing
and reporting skills. (Each graduate student that has taken the CHEG
604 and CHEG 867 courses at the University of Delaware has had to do
a term project of this type.)
5. Data Sets. All the data sets used in each chapter, whether in the chapter
itself, in an example, or in the exercises or application problems, are made
available on-line and on CD.
Suggested Coverage
Of the three categories mentioned earlier, a methodical coverage of the entire textbook is only possible for Category I, in a two-semester undergraduate
sequence. For this group, the following is one possible approach to dividing
the material up into instruction modules for each semester:
First Semester
Module 1 (Foundations): Chapters 02.
Module 2 (Probability): Chapters 3, 4, 5 and 7.
Module 3 (Probability Models): Chapter 81 (omit detailed derivations
and Section 8.7.2), Chapter 91 (omit detailed derivations), and Chapter
111 (cover Sections 11.4 and 11.5 selectively; omit Section 11.6).
Module 4 (Introduction to Statistics/Sampling): Chapters 12 and 13.
Module 5 (Statistical Inference): Chapter 141 (omit Section 14.6), Chapter 151 (omit Sections 15.8 and 15.9), Chapter 161 (omit Sections 16.4.3,
16.4.4, and 16.5.2), and Chapter 17.
Module 6 (Design of Experiments): Chapter 191 (cover Sections 19.3
19.4 lightly; omit Section 19.10) and Chapter 20.
Second Semester
Module 7 (Probability and Models): Chapters 6 (with ad hoc reference
to Chapters 4 and 5); Chapters 82 and 92 (include details omitted in the
rst semester), Chapter 10.
Module 8 (Statistical Inference): Chapter 142 (Bayesian estimation, Section 14.6), Chapter 152 (Sections 15.8 and 15.9), Chapter 162 (Sections
16.4.3, 16.4.4, and 16.5.2), and Chapter 18.
Module 9 (Applications): Select one of Chapter 21, 22 or 23. (For chemical engineers, and anyone planning to work in the manufacturing industry, I recommend Chapter 22.)
With this as a basic template, other variations can be designed as appropriate.
For example, those who can only aord one semester (Category II) may
adopt the rst semester suggestion above, to which I recommend adding Chapter 22 at the end.
viii
The beginning graduate one-semester course (Cateogory III) may also be
based on the rst semester suggestion above, but with the following additional
recommendations: (i) cover of all the recommended chapters fully; (ii) add
Chapter 23 on multivariate analysis; and (iii) in lieu of a nal examination,
assign at least one, possibly two, of the nine projects.
This will make for a hectic semester, but graduate students should be able
to handle the workload.
A second, perhaps more straightforward, recommendation for a twosemester sequence is to devote the rst semester to Probability (Chapters
011), and the second to Statistics (Chapters 1220) along with one of the
three application chapters.
Acknowledgments
Pulling o a project of this magnitude requires the support and generous
assistance of many colleagues, students, and family. Their genuine words of encouragement and the occasional (innocent and not-so-innocent) inquiry about
the status of the book all contributed to making sure that this potentially
endless project was actually nished. At the risk of leaving someone out, I feel
some deserve particular mention. I begin with, in alphabetical order, Marc
Birtwistle, Ketan Detroja, Claudio Gelmi (Chile), Mary McDonald, Vinay
Prasad (Alberta, Canada), Paul Taylor (AIMS, Muizenberg, South Africa),
and Carissa Young. These are colleagues, former and current students, and
postdocs, who patiently waded through many versions of various chapters,
oered invaluable comments and caught many of the manuscript errors, typographical and otherwise. It is a safe bet that the manuscript still contains
a random number of these errors (few and Poisson distributed, I hope!) but
whatever errors remain are my responsibility. I encourage readers to let me
know of the ones they nd.
I wish to thank my University of Delaware colleagues, Antony Beris and
especially Dion Vlachos, with whom I often shared the responsibility of teaching CHEG 867 to beginning graduate students. Their insight into what the
statistics component of the course should contain was invaluable (as were the
occasional Greek lessons!). Of my other colleagues, I want to thank Dennis
Williams of Basel, for his interest and comments, and then single out former
fellow DuPonters Mike Piovoso, whose ngerprint is recognizable on the
illustrative example of Chapter 23, Ra Sela, now a Six-Sigma Master Black
Belt, Mike Deaton of James Madison University, and Ron Pearson, whose
near-encyclopedic knowledge never ceases to amaze me. Many of the ideas,
problems and approaches evident in this book arose from those discussions
and collaborations from many years ago. Of my other academic colleagues, I
wish to thank Carl Laird of Texas A & M for reading some of the chapters,
Joe Qin of USC for various suggestions, and Jim Rawlings of Wisconsin with
whom I have carried on a long-running discussion about probability and estimation because of his own interests and expertise in this area. David Bacon
ix
and John MacGregor, pioneers in the application of statistics and probability in chemical engineering, deserve my thanks for their early encouragement
about the project and for providing the occasional commentary. I also wish to
take this opportunity to acknowledge the inuence and encouragement of my
chemical engineering mentor, Harmon Ray. I learned more from Harmon than
he probably knew he was teaching me. Much of what is in this book carries
an echo of his voice and reects the Wisconsin tradition.
List of Figures
1.1
1.2
1.3
1.4
1.5
1.6
19
20
22
24
24
36
39
3.1
3.2
3.3
3.4
3.5
66
72
73
74
75
4.1
2.1
2.2
2.3
4.2
4.3
4.4
4.5
4.6
4.7
4.8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
37
91
97
110
110
111
112
117
118
xi
xii
5.1
5.2
5.3
5.4
6.1
6.2
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
9.10
9.11
9.12
9.13
9.14
variable of
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
149
159
159
160
178
193
262
267
270
274
286
289
291
295
295
298
304
304
305
307
xiii
9.15 Two uniform distributions over dierent ranges (0,1) and (2,10).
Since the total area under the pdf must be 1, the narrower pdf is
proportionately longer than the wider one. . . . . . . . . . . . .
Two F distribution plots for dierent values for 1 , the rst degree
of freedom, but the same value for 2 . Note how the mode shifts to
the right as 1 increases . . . . . . . . . . . . . . . . . . . . . .
Three tdistribution plots for degrees of freedom values =
5, 10, 100. Note the symmetrical shape and the heavier tail for
smaller values of . . . . . . . . . . . . . . . . . . . . . . . . . .
A comparison of the tdistribution with = 5 with the standard
normal N (0, 1) distribution. Note the similarity as well as the tdistributions comparatively heavier tail. . . . . . . . . . . . . .
A comparison of the tdistribution with = 50 with the standard
normal N (0, 1) distribution. The two distributions are practically
indistinguishable. . . . . . . . . . . . . . . . . . . . . . . . . . .
A comparison of the standard Cauchy distributions with the standard normal N (0, 1) distribution. Note the general similarities as
well as the Cauchy distributions substantially heavier tail. . . . .
Common probability distributions and connections among them .
315
319
340
379
381
382
9.16
9.17
9.18
9.19
9.20
9.21
11.1
11.2
11.3
11.4
309
311
312
313
313
383
383
384
386
388
388
389
389
391
392
393
xiv
11.15Relative sensitivity of the binomial model derived n to errors in
estimates of p as a function of p . . . . . . . . . . . . . . . . . .
12.1 Relating the tools of Probability, Statistics and Design of Experiments to the concepts of Population and Sample . . . . . . . . .
12.2 Bar chart of welding injuries from Table 12.1 . . . . . . . . . . .
12.3 Bar chart of welding injuries arranged in decreasing order of number
of injuries . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.4 Pareto chart of welding injuries . . . . . . . . . . . . . . . . . .
12.5 Pie chart of welding injuries . . . . . . . . . . . . . . . . . . . .
12.6 Bar Chart of frozen ready meals sold in France in 2002 . . . . . .
12.7 Pie Chart of frozen ready meals sold in France in 2002 . . . . . .
12.8 Histogram for YA data of Chapter 1 . . . . . . . . . . . . . . . .
12.9 Frequency Polygon of YA data of Chapter 1 . . . . . . . . . . . .
12.10Frequency Polygon of YB data of Chapter 1 . . . . . . . . . . . .
12.11Boxplot of the chemical process yield data YA , YB of Chapter 1 .
12.12Boxplot of random N(0,1) data: original set, and with added outlier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.13Box plot of raisins dispensed by ve dierent machines . . . . . .
12.14Scatter plot of cranial circumference versus nger length: The plot
shows no real relationship between these variables . . . . . . . . .
12.15Scatter plot of city gas mileage versus highway gas mileage for various two-seater automobiles: The plot shows a strong positive linear
relationship, but no causality is implied. . . . . . . . . . . . . . .
12.16Scatter plot of highway gas mileage versus engine capacity for various two-seater automobiles: The plot shows a negative linear relationship. Note the two unusually high mileage values associated
with engine capacities 7.0 and 8.4 liters identied as belonging to
the Chevrolet Corvette and the Dodge Viper, respectively. . . . .
12.17Scatter plot of highway gas mileage versus number of cylinders for
various two-seater automobiles: The plot shows a negative linear
relationship. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.18Scatter plot of US population every ten years since the 1790 census versus census year: The plot shows a strong non-linear trend,
with very little scatter, indicative of the systematic, approximately
exponential growth . . . . . . . . . . . . . . . . . . . . . . . . .
12.19Scatter plot of Y1 and X1 from Anscombe data set 1. . . . . . . .
12.20Scatter plot of Y2 and X2 from Anscombe data set 2. . . . . . . .
12.21Scatter plot of Y3 and X3 from Anscombe data set 3. . . . . . . .
12.22Scatter plot of Y4 and X4 from Anscombe data set 4. . . . . . . .
396
415
420
420
421
422
423
424
425
427
428
429
430
431
432
433
434
434
435
444
445
445
446
469
470
xv
13.3 Sampling distribution of the mean diameter of ball bearings in Ex 10| 0.14) = P (|T | 0.62) .
ample 13.4 used to compute P (|X
473
475
13.5 Sampling distribution for the two variances of ball bearing diameters
in Example 13.6 used to compute P (F 1.41) + P (F 0.709) . .
476
491
504
511
for X/,
based on a sample of size n = 10 from an exponential
population . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
516
for X/,
based on a larger sample of size n = 100 from an exponential population . . . . . . . . . . . . . . . . . . . . . . . . . . .
517
15.1 A distribution for the null hypothesis, H0 , in terms of the test statistic, QT , where the shaded rejection region, QT > q, indicates a signicance level, . . . . . . . . . . . . . . . . . . . . . . . . . .
557
559
564
565
565
15.6 Box plot for Method A scores including the null hypothesis mean,
H0 : = 75, shown along with the sample average, x
, and the
95% condence interval based on the t-distribution with 9 degrees
of freedom. Note how the upper bound of the 95% condence interval
lies to the left of, and does not touch, the postulated H0 value . .
574
xvi
15.7 Box plot for Method B scores including the null hypothesis mean,
H0 , = 75, shown along with the sample average, x
, and the 95%
condence interval based on the t-distribution with 9 degrees of freedom. Note how the the 95% condence interval includes the postulated H0 value . . . . . . . . . . . . . . . . . . . . . . . . . . .
574
15.8 Box plot of dierences between the Before and After weights,
including a 95% condence interval for the mean dierence, and the
hypothesized H0 point, 0 = 0 . . . . . . . . . . . . . . . . . . .
15.9 Box plot of the Before and After weights including individual
data means. Notice the wide range of each data set . . . . . . . .
15.10A plot of the Before and After weights for each patient. Note
how one data sequence is almost perfectly correlated with the other;
in addition note the relatively large variability intrinsic in each data
set compared to the dierence between each point . . . . . . . . .
588
590
590
15.12 and power values for hypothesis test of Fig 15.11 with Ha
N (2.5, 1). Top:; Bottom: Power = (1 ) . . . . . . . . . . . .
15.13Rejection regions for one-sided tests of a single variance of a normal
population, at a signicance level of = 0.05, based on n = 10
samples. The distribution is 2 (9); Top: for Ha : 2 > 02 , indicating
rejection of H0 if c2 > 2 (9) = 16.9; Bottom: for Ha : 2 < 02 ,
indicating rejection of H0 if c2 < 21 (9) = 3.33 . . . . . . . . .
592
594
602
604
15.15Rejection regions for the two-sided tests of the equality of the vari2
2
= B
,
ances of the process A and process B yield data, i.e., H0 : A
at a signicance level of = 0.05, based on n = 50 samples each.
The distribution is F (49, 49), with the rejection region shaded; since
the test statistic, f = 0.27, falls within the rejection region to the
left, we reject H0 in favor of Ha . . . . . . . . . . . . . . . . . . .
606
649
654
xvii
16.3 The Gaussian assumption regarding variability around the true regression line giving rise to N (0, 2 ): The 6 points represent the
data at x1 , x2 , . . . , x6 ; the solid straight line is the true regression
line which passes through the mean of the sequence of the indicated
Gaussian distributions . . . . . . . . . . . . . . . . . . . . . . .
655
16.4 The tted straight line to the Density versus Ethanol Weight % data:
The additional terms included in the graph, S, R-Sq and R-Sq(adj)
are discussed later . . . . . . . . . . . . . . . . . . . . . . . . .
659
16.5 The tted regression line to the Density versus Ethanol Weight %
data (solid line) along with the 95% condence interval (dashed line).
The condence interval is narrowest at x = x
and widens for values
further away from x
. . . . . . . . . . . . . . . . . . . . . . . . .
664
16.6 The tted straight line to the Cranial circumference versus Finger
length data. Note how the data points are widely scattered around
the tted regression line. (The additional terms included in the
graph, S, R-Sq and R-Sq(adj) are discussed later) . . . . . . . . .
667
16.7 The tted straight line to the Highway MPG versus Engine Capacity
data of Table 12.5 (leaving out the two inconsistent data points)
along with the 95% condence interval (long dashed line) and the
95% prediction interval (short dashed line). (Again, the additional
terms included in the graph, S, R-Sq and R-Sq(adj) are discussed
later). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
670
681
683
692
695
703
xviii
16.13Modeling the dependence of the boiling points (BP) of hydrocarbon
compounds in Table 16.1 on the number of carbon atoms in the compound: Top: Fitted cubic curve of BP versus n, the number of carbon
atoms; Bottom: standardized residuals versus tted value yi . There
appears to be little or no systematic structure left in the residuals,
suggesting that the cubic model provides an adequate description of
the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
705
707
17.1 Probability plots for safety data postulated to be exponentially distributed, each showing (a) rank ordered data; (b) theoretical tted
cumulative probability distribution line along with associated 95%
condence intervals; (c) a list of summary statistics, including the
p-value associated with a formal goodness-of-t test. The indication
from the p-values is that there is no evidence to reject H0 ; therefore
the model appears to be adequate . . . . . . . . . . . . . . . . .
738
739
740
17.4 Normal probability plot for the residuals of the regression analysis
of the dependence of thermal conductivity, k, on Temperature in
Example 16.5. The postulated model, a two-parameter regression
model with Gaussian distributed zero mean errors, appears valid. .
741
17.5 Chi-Squared test results for inclusions data and a postulated Poisson
model. Top panel: Bar chart of Expected and Observed frequencies, which shows how well the model prediction matches observed
data; Bottom Panel: Bar chart of contributions to the Chi-squared
statistic, showing that the group of 3 or more inclusions is responsible for the largest model-observation discrepancy, by a wide margin. 744
xix
18.2 Probability plot of interspike intervals data with postulated Gamma
model and Anderson-Darling test for the pyramidal tract cell of a
monkey. Top panel: when awake (PT-W); Bottom panel: when asleep
(PT-S). The p-values for the A-D tests indicate no evidence to reject
the null hypothesis . . . . . . . . . . . . . . . . . . . . . . . . .
776
800
802
19.3 Normal probability plots of the residuals from the one-way classication ANOVA model in Example 19.1. Top panel: Plot obtained
directly from the ANOVA analysis which does not provide any test
statistic or signicance level; Bottom panel: Subsequent goodnessof-t test carried out on saved residuals; note the high p-value associated with the A-D test. . . . . . . . . . . . . . . . . . . . . .
804
807
810
19.6 2 factorial design for factors A and B showing the four experimental
points; represents low values, + represents high values for each
factor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
815
19.8 Normal probability plot for the eects, using Lenths method to
identify A, D and AD as signicant. . . . . . . . . . . . . . . . .
19.9 Normal probability plot for the residuals of the Etch rate model in
Eq (19.46) obtained upon projection of the experimental data to
retain only the signicant terms A, Gap (x1 ), D, Power (x2 ), and
the interaction AD, Gap*Power (x1 x2 ). . . . . . . . . . . . . . .
826
830
832
835
19.11The 3-factor Box-Behnken response surface design and its constituent parts: X1 , X2 : 22 factorial points moved to the center of
X3 to give the darker shaded circles at the edge-centers of the X3
axes; X2 , X3 : 22 factorial points moved to the center of X1 to give
the lighter shaded circles at the edge-centers of the X1 axes; X1 , X3 :
22 factorial points moved to the center of X2 to give the solid circles
at the edge-centers of the X2 axes; the center point, open circle. .
836
xx
20.1 Chi-Squared test results for Prussian army death by horse kicks data
and a postulated Poisson model. Top panel: Bar chart of Expected
and Observed frequencies; Bottom Panel: Bar chart of contributions to the Chi-squared statistic. . . . . . . . . . . . . . . . . .
20.2 Initial prior distribution, a Gamma (2,0.5), used to obtain a Bayesian
estimate for the Poisson mean number of deaths per unit-year parameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.3 Recursive Bayesian estimates using yearly data sequentially, compared with the standard maximum likelihood estimate, 0.61,
(dashed-line). . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.4 Final posterior distribution (dashed line) along with initial prior
distribution (solid line). . . . . . . . . . . . . . . . . . . . . . .
20.5 Quadratic regression model t to US Population data along with
both the 95% condence interval and the 95% prediction interval.
20.6 Standardized residuals from the regression model t to US Population data. Top panel: Residuals versus observation order; Bottom
panel: Normal probability plot. Note the left-over pattern indicative
of serial correlation, and the unusual observations identied for the
1940 and 1950 census years in the top panel; note also the general
deviation of the residuals from the theoretical normal probability
distribution line in the bottom panel. . . . . . . . . . . . . . . .
20.7 Percent average relative population growth rate in the US for each
census year from 1800-2000 divided into three equal 70-year periods.
Period 1: 1800-1860; Period 2: 1870-1930; Period 3: 1940-2000. . .
20.8 Normal probability plot for the residuals from the ANOVA model
for Percent average relative population growth rate versus Period
with Period 1: 1800-1860; Period 2: 1870-1930; Period 3: 1940-2000.
20.9 Standardized residual plots for Yield response surface model: versus tted value, and normal probability plot. . . . . . . . . . . .
20.10Standardized residual plots for Adhesion response surface model:
versus tted value, and normal probability plot. . . . . . . . . . .
20.11Response surface and contour plots for Yield as a function of Additive and Temperature (with Time held at 60.00). . . . . . . . .
20.12Response surface and contour plots for Adhesion as a function of
Additive and Temperature (with Time held at 60.00). . . . . . . .
20.13Overlaid contours for Yield and Adhesion showing feasible region for desired optimum. The planted ag indicates the optimum
values of the responses along with the corresponding setting of the
factors Additive and Temperature (with Time held at 60.00) that
achieve this optimum. . . . . . . . . . . . . . . . . . . . . . . .
20.14Schematic diagram of folded helicopter prototype . . . . . . . . .
20.15Paper helicopter prototype . . . . . . . . . . . . . . . . . . . . .
861
864
867
868
874
875
877
878
884
885
886
887
888
891
893
902
902
xxi
21.3 Sampling-analyzer system: basic conguration . . . . . . . . . . .
21.4 Sampling-analyzer system: conguration with redundant solenoid
valve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.5 Fluid ow system with a cross link . . . . . . . . . . . . . . . .
21.6 Typical failure rate (hazard function) curve showing the classic three
distinct characteristic periods in the lifetime distributions of a population of items . . . . . . . . . . . . . . . . . . . . . . . . . .
907
907
909
913
926
927
927
928
928
929
22.1 OC Curve for a lot size of 1000, sample size of 32 and acceptance
number of 3: AQL is the acceptance quality level; RQL is the rejection quality level. . . . . . . . . . . . . . . . . . . . . . . . . . .
22.2 OC Curve for a lot size of 1000, generated for a sampling plan for an
AQL= 0.004 and an RQL = 0.02, leading to a required sample size
of 333 and acceptance number of 3. Compare with the OC curve in
Fig 22.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.3 A generic SPC chart for the generic process variable Y indicating a
sixth data point that is out of limits. . . . . . . . . . . . . . . .
22.4 The X-bar chart for the average length measurements for 6-inch nails
determined from samples of three measurements obtained every 5
mins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.5 The S-chart for the 6-inch nails process data of Example 22.2. . .
22.6 The combination Xbar-R chart for the 6-inch nails process data of
Example 22.2. . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.7 The combination I-MR chart for the Mooney viscosity data. . . .
22.8 P-chart for the data on defective mechanical pencils: note the 9th
observation that is outside the UCL. . . . . . . . . . . . . . . . .
22.9 C-chart for the inclusions data presented in Chapter 1, Table 1.2,
and discussed in subsequent chapters: note the 33rd observation that
is outside the UCL, otherwise, the process appears to be operating
in statistical control . . . . . . . . . . . . . . . . . . . . . . . .
22.10Time series plot of the original Mooney viscosity data of Fig 22.7
and Table 22.2, and of the shifted version showing a step increase of
0.7 after sample 15. . . . . . . . . . . . . . . . . . . . . . . . .
939
943
946
948
951
952
954
956
958
959
22.11I-chart for the shifted Mooney viscosity data. Even with = 0.5, it
is not sensitive enough to detect the step change of 0.7 introduced
after sample 15. . . . . . . . . . . . . . . . . . . . . . . . . . .
960
xxii
22.12Two one-sided CUSUM charts for the shifted Mooney viscosity data.
The upper chart uses dots; the lower chart uses diamonds; the nonconforming points are represented with the squares. With the same
= 0.5, the step change of 0.7 introduced after sample 15 is identied after sample 18. Compare with the I-Chart in Fig 22.11. . . .
962
22.13Two one-sided CUSUM charts for the original Mooney viscosity data
using the same characteristics as those in Fig 22.12. The upper
chart uses dots; the lower chart uses diamonds; there are no nonconforming points. . . . . . . . . . . . . . . . . . . . . . . . . .
962
22.14EWMA chart for the shifted Mooney viscosity data, with w = 0.2.
Note the staircase shape of the control limits for the earlier data
points. With the same = 0.5, the step change of 0.7 introduced
after sample 15 is detected after sample 18. The non-conforming
points are represented with the squares. Compare with the I-Chart
in Fig 22.11 and the CUSUM charts in Fig 22.12. . . . . . . . . .
964
22.15The EWMA chart for the original Mooney viscosity data using the
same characteristics as in Fig 22.14. There are no non-conforming
points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
965
23.1 Examples of the bivariate Gaussian distribution where the two random variables are uncorrelated ( = 0) and strongly positively correlated ( = 0.9). . . . . . . . . . . . . . . . . . . . . . . . . . .
981
992
994
995
23.5 Plot of the scores and loading for the second principal component.
The distinct trend indicated in the scores should be interpreted along
with the loadings by comparison to the full original data set in Fig
23.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
996
23.6 Scores and loading plots for the rst two components. Top panel:
Scores plot indicates a quadratic relationship between the two scores
t1 and t2 ; Bottom panel: Loading vector plot indicates that in the
new set of coordinates, the original variables contain mostly pure
components PC1 and PC2 indicated by a distinctive North/South
and West/East alignment of the data vectors, with like variables
clustered together according to the nature of the component contributions. Compare to the full original data set in Fig 23.2. . . . . .
998
xxiii
23.8 Control limits for Q and T 2 for process data represented with two
principal components. . . . . . . . . . . . . . . . . . . . . . . . 1001
xxiv
List of Tables
1.1
1.2
1.3
1.4
1.5
.
.
.
.
.
13
16
18
19
21
2.1
44
3.1
3.2
3.3
63
65
85
4.1
96
103
104
4.2
4.3
5.1
5.2
5.3
5.4
5.5
. . . . .
. . . . .
. . . . .
. . . . .
Example
. . . . .
151
152
152
153
202
7.4
8.1
8.2
241
245
9.1
318
7.1
7.2
7.3
162
207
208
210
xxv
xxvi
10.1 Summary of maximum entropy probability models . . . . . .
11.1 Theoretical distribution of probabilities of possible outcomes of
an IVF treatment . . . . . . . . . . . . . . . . . . . . . . . . .
11.2 Elsner, et al., data of outcomes of a 42-month IVF treatment
study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3 Binomial model prediction of Elsner, et al. data in Table 11.2
11.4 Elsner data stratied by age indicating variability in the probability of success estimates . . . . . . . . . . . . . . . . . . .
11.5 Stratied binomial model prediction of Elsner, et al. data. . .
12.1 Number and Type of injuries incurred by welders in the USA
from 1980-1989 . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2 Frozen Ready meals in France, in 2002 . . . . . . . . . . . . .
12.3 Group classication and frequencies for YA data . . . . . . . .
12.4 Number of raisins dispensed into trial-sized Raising Bran cereal boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.5 Gasoline mileage ratings for a collection of two-seater automobiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.6 Descriptive statistics for yield data sets YA and YB . . . . . .
12.7 The Anscombe data set 1 . . . . . . . . . . . . . . . . . . . .
12.8 The Anscombe data sets 2, 3, and 4 . . . . . . . . . . . . . .
356
373
376
378
379
382
419
422
425
430
433
441
443
443
549
15.1
15.2
15.3
15.4
15.5
15.6
558
566
571
577
579
550
586
587
598
601
604
608
645
649
658
xxvii
16.3 Density and weight percent of ethanol in ethanol-water mixture: model t and residual errors . . . . . . . . . . . . . . . .
16.4 Cranial circumference and nger lengths . . . . . . . . . . . .
16.5 ANOVA Table for Testing Signicance of Regression . . . . .
16.6 Thermal conductivity measurements at various temperatures
for a metal . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.7 Laboratory experimental data on Yield . . . . . . . . . . . .
17.1 Table of values for safety data probability plot . . . . . . . .
18.1 A professors teaching evaluation scores organized by student
type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.2 Interspike intervals data . . . . . . . . . . . . . . . . . . . . .
18.3 Summary of Selected Nonparametric Tests and their Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.1 Data table for typical single-factor experiment . . . . . . .
19.2 One-Way Classication ANOVA Table . . . . . . . . . . .
19.3 Data table for typical single-factor, two-way classication,
periment . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.4 Two-Way Classication ANOVA Table . . . . . . . . . . .
19.5 Data table for typical two-factor experiment . . . . . . . .
19.6 Two-factor ANOVA Table . . . . . . . . . . . . . . . . . .
. .
. .
ex. .
. .
. .
. .
659
666
675
679
693
735
759
773
779
799
801
806
808
813
813
858
859
862
866
869
871
877
880
921
949
953
956
xxviii
Contents
0 Prelude
0.1 Approach Philosophy . . . . . . . . . . . . . . . . . . . . . .
0.2 Four basic principles . . . . . . . . . . . . . . . . . . . . . . .
0.3 Summary and Conclusions . . . . . . . . . . . . . . . . . . .
1
1
3
5
Foundations
II
Probability
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
12
12
14
14
16
17
18
22
25
.
.
.
.
.
.
.
.
.
.
.
.
33
34
34
35
41
41
42
42
43
43
44
47
48
53
xxix
xxx
3 Fundamentals of Probability Theory
3.1 Building Blocks . . . . . . . . . . . . .
3.2 Operations . . . . . . . . . . . . . . . .
3.2.1 Events, Sets and Set Operations
3.2.2 Set Functions . . . . . . . . . . .
3.2.3 Probability Set Function . . . . .
3.2.4 Final considerations . . . . . . .
3.3 Probability . . . . . . . . . . . . . . . .
3.3.1 The Calculus of Probability . . .
3.3.2 Implications . . . . . . . . . . . .
3.4 Conditional Probability . . . . . . . . .
3.4.1 Illustrating the Concept . . . . .
3.4.2 Formalizing the Concept . . . . .
3.4.3 Total Probability . . . . . . . . .
3.4.4 Bayes Rule . . . . . . . . . . . .
3.5 Independence . . . . . . . . . . . . . . .
3.6 Summary and Conclusions . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
58
61
61
65
67
68
69
69
71
72
72
73
74
76
77
78
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
89
90
90
94
94
95
95
98
100
102
102
105
107
107
113
115
116
119
119
122
122
123
124
124
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xxxi
5 Multidimensional Random Variables
137
5.1 Introduction and Denitions . . . . . . . . . . . . . . . . . . 138
5.1.1 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . 138
5.1.2 2-Dimensional (Bivariate) Random Variables . . . . . 139
5.1.3 Higher-Dimensional (Multivariate) Random Variables
140
5.2 Distributions of Several Random Variables . . . . . . . . . . 141
5.2.1 Joint Distributions . . . . . . . . . . . . . . . . . . . . 141
5.2.2 Marginal Distributions . . . . . . . . . . . . . . . . . . 144
5.2.3 Conditional Distributions . . . . . . . . . . . . . . . . 147
5.2.4 General Extensions . . . . . . . . . . . . . . . . . . . . 153
5.3 Distributional Characteristics of Jointly Distributed Random
Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.3.1 Expectations . . . . . . . . . . . . . . . . . . . . . . . 154
5.3.2 Covariance and Correlation . . . . . . . . . . . . . . . 157
5.3.3 Independence . . . . . . . . . . . . . . . . . . . . . . . 158
5.4 Summary and Conclusions . . . . . . . . . . . . . . . . . . . 163
6 Random Variable Transformations
6.1 Introduction and Problem Denition .
6.2 Single Variable Transformations . . . .
6.2.1 Discrete Case . . . . . . . . . . .
6.2.2 Continuous Case . . . . . . . . .
6.2.3 General Continuous Case . . . .
6.2.4 Random Variable Sums . . . . .
6.3 Bivariate Transformations . . . . . . .
6.4 General Multivariate Transformations .
6.4.1 Square Transformations . . . . .
6.4.2 Non-Square Transformations . .
6.4.3 Non-Monotone Transformations .
6.5 Summary and Conclusions . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
171
172
172
173
175
176
177
182
184
184
185
188
188
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
197
198
199
199
201
201
205
208
209
209
210
212
212
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xxxii
III
Distributions
213
217
218
219
219
220
221
221
221
222
222
222
223
224
224
225
225
225
226
227
230
230
231
232
234
236
236
237
239
240
243
257
259
260
264
271
272
276
276
278
279
290
292
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xxxiii
9.3
9.4
9.2.4
9.2.5
Ratio
9.3.1
9.3.2
297
300
300
301
307
308
309
311
314
316
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
363
364
365
365
367
371
371
372
373
375
375
xxxiv
11.4.2 Binomial Model versus Clinical Data . . . . . . . . . .
11.5 Problem Solution: Model-based IVF Optimization and Analysis
11.5.1 Optimization . . . . . . . . . . . . . . . . . . . . . . .
11.5.2 Model-based Analysis . . . . . . . . . . . . . . . . . .
11.5.3 Patient Categorization and Theoretical Analysis of
Treatment Outcomes . . . . . . . . . . . . . . . . . . .
11.6 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . .
11.6.1 General Discussion . . . . . . . . . . . . . . . . . . . .
11.6.2 Theoretical Sensitivity Analysis . . . . . . . . . . . . .
11.7 Summary and Conclusions . . . . . . . . . . . . . . . . . . .
11.7.1 Final Wrap-up . . . . . . . . . . . . . . . . . . . . . .
11.7.2 Conclusions and Perspectives on Previous Studies and
Guidelines . . . . . . . . . . . . . . . . . . . . . . . . .
IV
Statistics
377
384
385
386
390
392
392
394
395
395
397
403
12 Introduction to Statistics
12.1 From Probability to Statistics . . . . . . . . . . . . . . .
12.1.1 Random Phenomena and Finite Data Sets . . . . .
12.1.2 Finite Data Sets and Statistical Analysis . . . . . .
12.1.3 Probability, Statistics and Design of Experiments .
12.1.4 Statistical Analysis . . . . . . . . . . . . . . . . . .
12.2 Variable and Data Types . . . . . . . . . . . . . . . . . .
12.3 Graphical Methods of Descriptive Statistics . . . . . . . .
12.3.1 Bar Charts and Pie Charts . . . . . . . . . . . . .
12.3.2 Frequency Distributions . . . . . . . . . . . . . . .
12.3.3 Box Plots . . . . . . . . . . . . . . . . . . . . . . .
12.3.4 Scatter Plots . . . . . . . . . . . . . . . . . . . . .
12.4 Numerical Descriptions . . . . . . . . . . . . . . . . . . .
12.4.1 Theoretical Measures of Central Tendency . . . . .
12.4.2 Measures of Central Tendency: Sample Equivalents
12.4.3 Measures of Variability . . . . . . . . . . . . . . .
12.4.4 Supplementing Numerics with Graphics . . . . . .
12.5 Summary and Conclusions . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
407
408
408
411
414
415
417
419
419
424
427
431
436
436
438
440
442
446
13 Sampling
13.1 Introductory Concepts . . . . . . . . . . . . . . . . .
13.1.1 The Random Sample . . . . . . . . . . . . . . .
13.1.2 The Statistic and its Distribution . . . . . .
13.2 The Distribution of Functions of Random Variables .
13.2.1 General Overview . . . . . . . . . . . . . . . .
13.2.2 Some Important Sampling Distribution Results
13.3 Sampling Distribution of The Mean . . . . . . . . . .
13.3.1 Underlying Probability Distribution Known . .
13.3.2 Underlying Probability Distribution Unknown .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
459
460
460
461
463
463
463
465
465
467
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xxxv
13.3.3 Limiting Distribution of the Mean
13.3.4 Unknown . . . . . . . . . . . . .
13.4 Sampling Distribution of the Variance . .
13.5 Summary and Conclusions . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
467
470
473
476
14 Estimation
487
14.1 Introductory Concepts . . . . . . . . . . . . . . . . . . . . . 488
14.1.1 An Illustration . . . . . . . . . . . . . . . . . . . . . . 488
14.1.2 Problem Denition and Key Concepts . . . . . . . . . 489
14.2 Criteria for Selecting Estimators . . . . . . . . . . . . . . . . 490
14.2.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . 490
14.2.2 Eciency . . . . . . . . . . . . . . . . . . . . . . . . . 491
14.2.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . 492
14.3 Point Estimation Methods . . . . . . . . . . . . . . . . . . . 493
14.3.1 Method of Moments . . . . . . . . . . . . . . . . . . . 493
14.3.2 Maximum Likelihood . . . . . . . . . . . . . . . . . . . 496
14.4 Precision of Point Estimates . . . . . . . . . . . . . . . . . . 503
14.5 Interval Estimates . . . . . . . . . . . . . . . . . . . . . . . . 506
14.5.1 General Principles . . . . . . . . . . . . . . . . . . . . 506
14.5.2 Mean of a Normal Population; Known . . . . . . . . 507
14.5.3 Mean of a Normal Population; Unknown . . . . . . 508
14.5.4 Variance of a Normal Population . . . . . . . . . . . . 510
14.5.5 Dierence of Two Normal Populations Means . . . . . 512
14.5.6 Interval Estimates for Parameters from other Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
14.6 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . 518
14.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 518
14.6.2 Basic Concept . . . . . . . . . . . . . . . . . . . . . . 519
14.6.3 Bayesian Estimation Results . . . . . . . . . . . . . . 520
14.6.4 A Simple Illustration . . . . . . . . . . . . . . . . . . . 521
14.6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 524
14.7 Summary and Conclusions . . . . . . . . . . . . . . . . . . . 527
15 Hypothesis Testing
15.1 Introduction . . . . . . . . . . . . . . . . . . . .
15.2 Basic Concepts . . . . . . . . . . . . . . . . . . .
15.2.1 Terminology and Denitions . . . . . . . .
15.2.2 General Procedure . . . . . . . . . . . . .
15.3 Concerning Single Mean of a Normal Population
15.3.1 Known; the z-test . . . . . . . . . . .
15.3.2 Unknown; the t-test . . . . . . . . .
15.3.3 Condence Intervals and Hypothesis Tests
15.4 Concerning Two Normal Population Means . . .
15.4.1 Population Standard Deviations Known .
15.4.2 Population Standard Deviations Unknown
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
551
552
554
554
560
561
563
570
575
576
576
578
xxxvi
15.4.3 Paired Dierences . . . . . . . . . . . . . . . . . . .
15.5 Determining , Power, and Sample Size . . . . . . . . . . .
15.5.1 and Power . . . . . . . . . . . . . . . . . . . . . .
15.5.2 Sample Size . . . . . . . . . . . . . . . . . . . . . . .
15.5.3 and Power for Lower-Tailed and Two-Sided Tests
15.5.4 General Power and Sample Size Considerations . . .
15.6 Concerning Variances of Normal Populations . . . . . . . .
15.6.1 Single Variance . . . . . . . . . . . . . . . . . . . . .
15.6.2 Two Variances . . . . . . . . . . . . . . . . . . . . .
15.7 Concerning Proportions . . . . . . . . . . . . . . . . . . . .
15.7.1 Single Population Proportion . . . . . . . . . . . . .
15.7.2 Two Population Proportions . . . . . . . . . . . . .
15.8 Concerning Non-Gaussian Populations . . . . . . . . . . .
15.8.1 Large Sample Test for Means . . . . . . . . . . . . .
15.8.2 Small Sample Tests . . . . . . . . . . . . . . . . . . .
15.9 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . .
15.9.1 General Principles . . . . . . . . . . . . . . . . . . .
15.9.2 Special Cases . . . . . . . . . . . . . . . . . . . . . .
15.9.3 Asymptotic Distribution for . . . . . . . . . . . .
15.10Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.11Summary and Conclusions . . . . . . . . . . . . . . . . . .
16 Regression Analysis
16.1 Introductory Concepts . . . . . . . . . . . . . . .
16.1.1 Dependent and Independent Variables . . .
16.1.2 The Principle of Least Squares . . . . . . .
16.2 Simple Linear Regression . . . . . . . . . . . . . .
16.2.1 One-Parameter Model . . . . . . . . . . . .
16.2.2 Two-Parameter Model . . . . . . . . . . . .
16.2.3 Properties of OLS Estimators . . . . . . . .
16.2.4 Condence Intervals . . . . . . . . . . . . .
16.2.5 Hypothesis Testing . . . . . . . . . . . . . .
16.2.6 Prediction and Prediction Intervals . . . . .
16.2.7 Coecient of Determination and the F-Test
16.2.8 Relation to the Correlation Coecient . . .
16.2.9 Mean-Centered Model . . . . . . . . . . . .
16.2.10 Residual Analysis . . . . . . . . . . . . . . .
16.3 Intrinsically Linear Regression . . . . . . . . . .
16.3.1 Linearity in Regression Models . . . . . . .
16.3.2 Variable Transformations . . . . . . . . . .
16.4 Multiple Linear Regression . . . . . . . . . . . . .
16.4.1 General Least Squares . . . . . . . . . . . .
16.4.2 Matrix Methods . . . . . . . . . . . . . . .
16.4.3 Some Important Special Cases . . . . . . .
16.4.4 Recursive Least Squares . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
585
591
591
593
598
599
600
601
603
606
607
610
613
613
614
616
616
619
622
623
624
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
647
648
650
651
652
652
653
660
661
664
668
671
676
677
678
682
682
685
686
687
688
694
698
xxxvii
16.5 Polynomial Regression . . . . . . . . . .
16.5.1 General Considerations . . . . . .
16.5.2 Orthogonal Polynomial Regression
16.6 Summary and Conclusions . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
700
700
704
710
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
731
732
733
733
734
736
737
739
739
742
745
18 Nonparametric Methods
18.1 Introduction . . . . . . . . . . . . . . . . . . . . .
18.2 Single Population . . . . . . . . . . . . . . . . . .
18.2.1 One-Sample Sign Test . . . . . . . . . . . .
18.2.2 One-Sample Wilcoxon Signed Rank Test . .
18.3 Two Populations . . . . . . . . . . . . . . . . . . .
18.3.1 Two-Sample Paired Test . . . . . . . . . . .
18.3.2 Mann-Whitney-Wilcoxon Test . . . . . . .
18.4 Probability Model Validation . . . . . . . . . . . .
18.4.1 The Kolmogorov-Smirnov Test . . . . . . .
18.4.2 The Anderson-Darling Test . . . . . . . . .
18.5 A Comprehensive Illustrative Example . . . . . .
18.5.1 Probability Model Postulate and Validation
18.5.2 Mann-Whitney-Wilcoxon Test . . . . . . .
18.6 Summary and Conclusions . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
757
758
760
760
763
765
766
766
770
770
771
772
772
775
777
19 Design of Experiments
19.1 Introductory Concepts . . . . . . . . . . . . .
19.1.1 Experimental Studies and Design . . . .
19.1.2 Phases of Ecient Experimental Studies
19.1.3 Problem Denition and Terminology . .
19.2 Analysis of Variance . . . . . . . . . . . . . . .
19.3 Single Factor Experiments . . . . . . . . . . .
19.3.1 One-Way Classication . . . . . . . . .
19.3.2 Kruskal-Wallis Nonparametric Test . . .
19.3.3 Two-Way Classication . . . . . . . . .
19.3.4 Other Extensions . . . . . . . . . . . . .
19.4 Two-Factor Experiments . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
791
793
793
794
795
796
797
797
805
805
811
811
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xxxviii
19.5 General Multi-factor Experiments . . .
19.6 2k Factorial Experiments and Design .
19.6.1 Overview . . . . . . . . . . . . .
19.6.2 Design and Analysis . . . . . . .
19.6.3 Procedure . . . . . . . . . . . . .
19.6.4 Closing Remarks . . . . . . . . .
19.7 Screening Designs: Fractional Factorial
19.7.1 Rationale . . . . . . . . . . . . .
19.7.2 Illustrating the Mechanics . . . .
19.7.3 General characteristics . . . . . .
19.7.4 Design and Analysis . . . . . . .
19.7.5 A Practical Illustrative Example
19.8 Screening Designs: Plackett-Burman . .
19.8.1 Primary Characteristics . . . . .
19.8.2 Design and Analysis . . . . . . .
19.9 Response Surface Designs . . . . . . . .
19.9.1 Characteristics . . . . . . . . . .
19.9.2 Response Surface Designs . . . .
19.9.3 Design and Analysis . . . . . . .
19.10Introduction to Optimal Designs . . . .
19.10.1 Background . . . . . . . . . . . .
19.10.2 Alphabetic Optimal Designs .
19.11Summary and Conclusions . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
814
814
814
816
817
821
822
822
822
823
825
827
832
833
833
834
834
835
836
837
837
838
839
Applications
895
xxxix
21 Reliability and Life Testing
21.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
21.2 System Reliability . . . . . . . . . . . . . . . . . . .
21.2.1 Simple Systems . . . . . . . . . . . . . . . . .
21.2.2 Complex Systems . . . . . . . . . . . . . . . .
21.3 System Lifetime and Failure-Time Distributions . .
21.3.1 Characterizing Time-to-Failure . . . . . . . .
21.3.2 Probability Models for Distribution of Failure
21.4 The Exponential Reliability Model . . . . . . . . . .
21.4.1 Component Characteristics . . . . . . . . . .
21.4.2 Series Conguration . . . . . . . . . . . . . .
21.4.3 Parallel Conguration . . . . . . . . . . . . .
21.4.4 m-of-n Parallel Systems . . . . . . . . . . . .
21.5 The Weibull Reliability Model . . . . . . . . . . . .
21.6 Life Testing . . . . . . . . . . . . . . . . . . . . . . .
21.6.1 The Exponential Model . . . . . . . . . . . .
21.6.2 The Weibull Model . . . . . . . . . . . . . . .
21.7 Summary and Conclusions . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Times
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
Per. . .
. . .
. . .
. . .
. . .
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
899
900
901
901
906
911
911
913
914
914
915
916
917
918
919
919
922
923
933
934
936
936
938
944
944
944
946
958
964
964
965
966
967
969
969
970
971
977
978
978
979
981
xl
23.1.4 Hotellings T -Squared Distribution
23.1.5 The Wilks Lambda Distribution .
23.1.6 The Dirichlet Distribution . . . . .
23.2 Multivariate Data Analysis . . . . . . . .
23.3 Principal Components Analysis . . . . .
23.3.1 Basic Principles of PCA . . . . . .
23.3.2 Main Characteristics of PCA . . .
23.3.3 Illustrative example . . . . . . . .
23.3.4 Other Applications of PCA . . . .
23.4 Summary and Conclusions . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
982
982
983
984
985
986
990
991
999
1002
Appendix
1005
Index
1009
Chapter 0
Prelude
0.1
0.2
0.3
Approach Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Four basic principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
3
5
From weather forecasts and life insurance premiums for non-smokers to clinical
tests of experimental drugs and defect rates in manufacturing facilities, and
in numerous other ways, randomly varying phenomena exert a subtle but pervasive inuence on everyday life. In most cases, one can be blissfully ignorant
of the true implications of the presence of such phenomena without consequence. In science and engineering, however, the inuence of randomly varying phenomena can be such that even apparently simple problems can become
dramatically complicated by the presence of random variabilitydemanding
special methods and analysis tools for obtaining valid and useful solutions.
The primary aim of this book is to provide the reader with the basic fundamental principles, methods, and tools for formulating and
solving engineering problems involving randomly varying phenomena.
Since this aim can be achieved in several dierent ways, this chapter is
devoted to presenting this books approach philosophy.
0.1
Approach Philosophy
Random Phenomena
randomly varying phenomena of one sort or another; and the vast majority of
such problems cannot always be idealized and reduced to the more familiar
deterministic types without destroying the very essence of the problem. For
example, in determining which of two catalysts A or B provides the greater
yield in a chemical manufacturing process , it is well-known that the respective
yields YA and YB , as observed experimentally, are randomly varying quantities. Chapter 1 presents a full-scale discussion of this problem. For now, we
simply note that with catalyst A, fty dierent experiments performed under
essentially identical conditions will result in fty dierent values (realizations)
for YA . Similarly for catalyst B, one obtains fty distinct values for YB from
fty dierent experiments replicated under identical conditions. The rst 10
experimental data points for this example are shown in the table below.
YA % YB %
74.04 75.75
75.29 68.41
75.62 74.19
75.91 68.10
77.21 68.10
75.07 69.23
74.23 70.14
74.92 69.22
76.57 74.17
77.77 70.23
Observe that because of the variability inherent in the data, some of the YA
values are greater than some of the YB values; but the converse is also true
some YB values are greater than some YA values. So how does one determine
reliably and condentlywhich catalyst (if any) really provides the greater
yield? Clearly, special methods and analysis tools are required for handling
this apparently simple problem: the deterministic idealization of comparing a
single observed value of YA (say the rst entry, 74.04) with a corresponding
single observed value of YB (in this case 75.75) is incapable of producing a
valid answer. The primary essence of this problem is the variability inherent
in the data which masks the fact that one catalyst does in fact provide the
greater yield.
This book takes a more fundamental, rst-principles approach to the
issue of dealing with random variability and uncertainty in engineering problems. This is in contrast to the typical engineering statistics approach on the
one hand, or the problem-specic approach on the other. With the former
approach, most of the emphasis is on how to use certain popular statistical
techniques to solve some of the most commonly encountered engineering problems, with little or no discussion of why the techniques are eective. With the
latter approach, a particular topic (say Design of Experiments) is selected
and dealt with in depth, and the appropriate statistical tools are presented
and discussed within the context of the specic problem at the core of the
Prelude
selected topic. By denition, such an approach excludes all other topics that
may be of practical interest, opting to make up in depth what it gives up in
breadth.
The approach taken in this book is based on the premise that emphasizing fundamentals and basic principles, and then illustrating
these with examples, equips the reader with the means of dealing
with a range of problems wider than that explicitly covered in the
book.
0.2
1. If characterized properly, random phenomena are subject to rigorous mathematical analysis in much the same manner as deterministic phenomena.
Random phenomena are so-called because they show no apparent regularity, appearing to occur haphazardlytotally at random; the observed variations do not seem to obey any discernible rational laws and therefore appear to
be entirely unpredictable. However, the unpredictable irregularities of the individual observations (or, more generally, the detail) of random phenomena
in fact co-exist with surprisingly predictable ensemble, or aggregate, behavior. This fact makes rigorous analysis possible; it also provides the basis for
employing the concept and calculus of probability to develop a systematic
framework for characterizing random phenomena in terms of probability distribution functions.
The rst order of business is therefore to seek to understand random phenomena and to develop techniques for characterizing them appropriately. Part
I, titled FOUNDATIONS: Understanding Random Variability, and Part II,
titled PROBABILITY: Characterizing Random Variability, are devoted to
these respective tasks. Ultimately, probabilityand the probability distribution functionare introduced as the theoretical constructs for eciently describing our knowledge of the real-world phenomena in question.
2. By focusing on the underlying phenomenological mechanisms , it is possible
to develop appropriate theoretical characterizations of random phenomena in
terms of ideal models of the observed variability.
Within the probabilistic framework, the ensemble, or aggregate behavior
Random Phenomena
Prelude
them. Clearly then, the sheer vastness of the subject matter of engineering
applications of probability and statistics renders completely unreasonable any
hope of comprehensive coverage in a single introductory text.
Nevertheless, how probability and statistics are employed in practice to
deal successfully with various problems created by random variability and
uncertainty can be discussed in such a way as to equip the student with
the tools needed to approach, with condence, other problems that are not
addressed explicitly in this book.
Part V, titled APPLICATIONS: Dealing with Random Variability in Practice, consists of three chapters each devoted to a specic application topic of
importance in engineering practice. Entire books have been written, and entire courses taught, on each of the topics to which we will devote only one
chapter; the coverage is therefore designed to be more illustrative than comprehensive, providing the basis for absorbing and employing more eciently,
the more extensive material presented in these other books or courses.
0.3
This chapter has been primarily concerned with setting forth this books
approach to presenting the fundamentals and engineering applications of probability and statistics. The four basic principles on which the more fundamental,
rst principles approach is based were presented, providing the rationale for
the scope and organization of the material to be presented in the rest of the
book.
The approach is designed to produce the following result:
A course of study based on this book should provide the reader with
a reasonable fundamental understanding of random phenomena, a
working knowledge of how to model and analyze such phenomena,
and facility with using probability and statistics to cope with random variability and uncertainty in some key engineering problems.
The book should also prepare the student to absorb and employ the material presented in more problem-specic courses such as Design of Experiments,
Time Series Analysis, Regression Analysis, Statistical Process Control, etc, a
bit more eciently.
Random Phenomena
Part I
Foundations
Part I: Foundations
Understanding Random Variability
10
Part I: Foundations
Understanding Random Variability
Chapter 1
Two Motivating Examples
1.1
1.2
1.3
1.4
11
12
12
14
14
15
17
18
22
23
25
26
27
28
12
Random Phenomena
1.1
1.1.1
The Problem
TABLE 1.1:
Yield Data
for Process A versus Process B
YA %
YB %
74.04 75.29 75.75 68.41
75.63 75.92 74.19 68.10
77.21 75.07 68.10 69.23
74.23 74.92 70.14 69.23
76.58 77.77 74.17 70.24
75.05 74.90 70.09 71.91
75.69 75.31 72.63 78.41
75.19 77.93 71.16 73.37
75.37 74.78 70.27 73.64
74.47 72.99 75.82 74.42
73.99 73.32 72.14 78.49
74.90 74.88 74.88 76.33
75.78 79.07 70.89 71.07
75.09 73.87 72.39 72.04
73.88 74.23 74.94 70.02
76.98 74.85 75.64 74.62
75.80 75.22 75.70 67.33
77.53 73.99 72.49 71.71
72.30 76.56 69.98 72.90
77.25 78.31 70.15 70.14
75.06 76.06 74.09 68.78
74.82 75.28 72.91 72.49
76.67 74.39 75.40 76.47
76.79 77.57 69.38 75.47
75.85 77.31 71.37 74.12
13
14
Random Phenomena
3. If yes, is YA YB > 2?
Clearly, making the proper decision hinges on our ability to answer these
questions with condence.
1.1.2
Observe that the real essence of the problem is random variability: if each
experiment had resulted in the same single, constant number for YA and another for YB , the problem would be deterministic in character, and each of
the 3 associated questions would be trivial to answer. Instead, the random
phenomena inherent in the experimental determination of the true process
yields have been manifested in the observed variability, so that we are uncertain about the true values of YA and YB , making it not quite as trivial to
solve the problem.
The sources of variability in this case can be shown to include the measurement procedure, the measurement device itself, raw materials, and process
conditions. The observed variability is therefore intrinsic to the problem and
cannot be idealized away. There is no other way to solve this problem rationally without dealing directly with the random variability.
Next, note that YA and YB data (observations) take on values on a continuous scale i.e. yield values are real and can be located anywhere on the
real line, as opposed to quantities that can take on integer values only (as is
the case with the second example discussed later). The variables YA and YB
are therefore said to be continuous and this example illustrates decisionmaking under uncertainty when the random phenomena in question involve
continuous variables.
The main issues with this problem are as follows:
1. Characterization: How should the quantities YA and YB be characterized
so that the questions raised above can be answered properly?
2. Quantication: Are there such things as true values of the quantities
YA and YB ? If so, how should these true values be best quantied?
3. Application: How should the characterization and quantication of YA
and YB be used to answer the 3 questions raised above?
1.1.3
Before outlining procedures for solving this problem, it is helpful to entertain some notions that the intuition of a good scientist or engineer will
suggest. For instance, the concept of the arithmetic average of a collection
of n data points, x1 , x2 , x3 , . . . , xn , dened by:
1
xi
n i=1
n
x
=
(1.1)
15
is well-known to all scientists and engineers, and the intuitive notion of employing this single computed value to represent the data set is almost instinctive.
It seems reasonable therefore to consider representing YA with the computed
average obtained from the data, i.e. yA = 75.52, and similarly, representing
YB with yB = 72.47. We may now observe right away that yA > yB , which
now seems to suggest not only that YA > YB , but since yA yB = 3.05, that
the dierence in fact exceeds the threshold of 2%.
As intuitively appealing as these arguments might be, they raise some
important additional questions:
1. The variability of individual values of the data yAi around the average
value yA = 75.52 is noticeable; that of yBi around the average value yB =
72.47 even more so. How condent then are we about the arguments
presented above, and in the implied recommendation to prefer process
A to B, based as they are on the computed averages? (For example,
there are some 8 values of yBi > yA ; what should we make of this fact?)
2. Will it (or should it) matter that
72.30 < yAi < 79.07
67.33 < yBi < 78.41
(1.2)
so that the observed data are seen to vary over a range of yield values
that is 11.08 units wide for process B as opposed to 6.77 for A? The
averages give no indication of these extents of variability.
3. More fundamentally, is it always a good idea to work with averages? How
reasonable is it to characterize the entire data set with the average?
4. If new sets of data are gathered, the new averages computed from them
will almost surely dier from the corresponding values computed from
the current set of data shown here. Observe therefore that the computed
averages yA and yB are themselves clearly subject to random variability.
How can we then be sure that using averages oers any advantages,
since, like the original data, these averages are also not free from random
variability?
5. How were the data themselves collected? What does it mean concretely
that the 50 experiments were carefully performed? Is it possible that
the experimental protocols used may have impaired our ability to answer
the questions posed above adequately? Conversely, are there protocols
that are particularly calibrated to improve our ability to answer these
questions adequately?
Obviously therefore there is a lot more to dealing with this example problem
than merely using the intuitively appealing notion of averages.
Let us now consider a second, dierent but somewhat complementary,
example.
16
Random Phenomena
TABLE 1.2:
inclusions
glass sheets
0 1 1
2 0 2
1 2 0
1 1 5
2 1 0
1 0 0
1.2
Number of
on sixty 1-sq meter
1
2
1
2
0
2
0
3
0
0
1
4
0
2
1
0
1
0
1
0
0
1
0
1
0
0
0
4
0
1
2
2
1
1
1
0
2
0
1
1
1
1
17
1.3
Even though the two illustrative problems presented above are dierent in
so many ways (one involves continuous variables, the other a discrete variable;
one is concerned with comparing two entities to each other, the other pits a
single set of data against a design target), the systematic approach to solving
such problems provided by probability and statistics applies to both in a
unied way. The fundamental issues at stake may be stated as follows:
In light of its dening characteristics of intrinsic variability, how
should randomly varying quantities be characterized and quantied precisely in order to facilitate the solution of practical problems
involving them?
18
Random Phenomena
TABLE 1.3:
Group classication
and frequencies for YA data (from the
proposed process)
Relative
YA group Frequency Frequency
71.51-72.50
1
0.02
2
0.04
72.51-73.50
9
0.18
73.51-74.50
74.51-75.50
17
0.34
75.51-76.50
7
0.14
8
0.16
76.51-77.50
77.51-78.50
5
0.10
78.51-79.50
1
0.02
TOTAL
50
1.00
What now follows is a somewhat informal examination of the ideas and concepts behind these time-tested techniques. The purpose is to motivate and
provide context for the more formal discussions in upcoming chapters.
1.3.1
Let us revisit the example data sets and consider the following alternative
approach to the data representation. Instead of focusing on individual observations as presented in the tables of raw data, what if we sub-divided the
observations into small groups (called bins) and re-organized the raw data
in terms of how frequently members of each group occur? One possible result
is shown in Tables 12.3 and 1.4 respectively for process A and process B. (A
dierent bin size will lead to a slightly dierent group classication but the
principles remain the same.)
This reclassication indicates, for instance, that for YA , there is only one
observation between 71.51 and 72.50 (the actual number is 72.30), but there
are 17 observations between 74.51 and 75.50; for YB on the other hand, 3
observations fall in the [67.51-68.50] group whereas there are 8 observations
between 69.51 and 70.50. The relative frequency column indicates what
proportion of the original 50 data points are found in each group. A plot of
this reorganization of the data, known as the histogram, is shown in Figure
12.8 for YA and Figure 1.2 for YB .
The histogram, a term rst used by Pearson in 1895, is a graphical representation of data from a group-classication and frequency-of-occurrence
perspective. Each bar represents a distinct group (or class) within the data
set, with the bar height proportional to the group frequency. Because this
graphical representation provides a picture of how the data are distributed
in terms of the frequency of occurrence of each group (how much each group
19
TABLE 1.4:
Group classication
and frequencies for YB data (from the
incumbent process)
Relative
YB group Frequency Frequency
66.51-67.50
1
0.02
3
0.06
67.51-68.50
68.51-69.50
4
0.08
8
0.16
69.51-70.50
4
0.04
70.51-71.50
71.51-72.50
7
0.14
4
0.08
72.51-73.50
6
0.12
73.51-74.50
74.51-75.50
5
0.10
6
0.12
75.51-76.50
0
0.00
76.51-77.50
77.51-78.50
2
0.04
0
0.00
78.51-79.50
TOTAL
50
1.00
18
16
14
Frequency
12
10
8
6
4
2
0
72
73
74
75
YA
76
77
78
79
20
Random Phenomena
9
8
7
Frequency
6
5
4
3
2
1
0
68
70
72
74
76
78
YB
21
TABLE 1.5:
Group
classication and frequencies for the
inclusions data
Relative
Frequency Frequency
X
0
22
0.367
23
0.383
1
11
0.183
2
3
1
0.017
4
2
0.033
1
0.017
5
6
0
0.000
TOTAL
60
1.000
22
Random Phenomena
25
Frequency
20
15
10
Inclusions
5% of the glass sheets (3 out of 60) have more than 3 inclusions, the remaining
95% have 3 or fewer; 93.3% (56 out of 60) have 2 or fewer inclusions. The
important point is that such quantitative characteristics of the data variability
(made possible by the histogram) is potentially useful for answering practical
questions about what one can reasonably expect from this process.
1.3.2
Theoretical Distributions
How can the benets of the histogram be consolidated into a useful tool
for quantitative analysis of randomly varying phenomena? The answer: by appealing to a fundamental axiom of random phenomena: that conceptually, as
more observations are made, the shape of the data histogram stabilizes, and
tends to the form of the theoretical distribution that characterizes the random
phenomenon in question, in the limit as the total number of observations approaches innity. It is important to note that this concept does not necessarily
require that an innite number of observations actually be obtained in practice, even if this were possible. The essence of the concept is that an underlying
theoretical distribution exists for which the frequency distribution represented
by the histogram is but a nite sample approximation; that the underlying theoretical distribution is an ideal model of the particular phenomenon
responsible for generating the nite number of observations contained in the
current data set; and hence that this theoretical distribution provides a reasonable mathematical characterization of the random phenomenon.
As we show later, these theoretical distributions may be derived from rst
principles given sucient knowledge regarding the underlying random phenomena. And, as the brief informal examination of the illustrative histograms
23
above indicates, these theoretical distributions can be used for various things.
For example, even though we have not yet provided any concrete denition
of the term probability, neither have we given any concrete justications of
its usage in this context, still from the discussion in the previous section, the
reader can intuitively attest to the reasonableness of the following statements:
the probability that YA 74.5 is 0.76; or the probability that YB 74.5
is 0.26; or the probability that X 1 is 0.75. Parts II and III are
devoted to establishing these ideas more concretely and more precisely.
A Preview
It turns out that the theoretical distribution for each yield data set is:
f (y|, ) =
(y)2
1
e 22 ; < y <
2
(1.3)
which, when superimposed on each histogram, is shown in Fig 1.4 for YA , and
Fig 1.5 for YB , when the indicated characteristic parameters are specied
as = 75.52, = 1.43 for YA , and = 72.47, = 2.76 for YB .
Similarly, the theoretical distribution for the inclusions data is:
e x
; x = 0, 1, 2, . . .
(1.4)
x!
where the characteristic parameter = 1.02 is the average number of inclusions in each glass sheet. In similar fashion to Eq 4.155, it also provides
a theoretical characterization and quantication of the random phenomenon
responsible for the variability observed in the inclusions data. From it we
are able, for example, to compute the theoretical probabilities of observing
0, 1, 2, . . ., inclusions in any one glass sheet manufactured by this process. A
plot of this theoretical probability distribution function is shown in Fig 22.41
(compare with the histogram in Fig 1.3).
The full detail of precisely what all this means is discussed in subsequent
chapters; for now, this current brief preview serves the purpose of simply indicating how the expression in Eqs 4.155 and 4.40 provide a theoretical means
of characterizing (and quantifying) the random phenomenon involved respectively in the yield data and in the inclusions data. Expressions such as this are
called probability distribution functions (pdfs) and they provide the basis
for rational analysis of random variability via the concept of probability.
Precisely what this concept of probability is, how it gives rise to pdfs, and
how pdfs are used to solve practical problems and provide answers to the sorts
of questions posed by these illustrative examples, constitute the primary focus
of the remaining chapters in the book.
At this point, it is best to defer the rest of the discussion until when we
revisit these two problems at appropriate places in upcoming chapters where
we show that:
f (x|) =
24
Random Phenomena
Histogram of YA
Normal
18
Mean
StDev
N
16
75.52
1.432
50
14
Frequency
12
10
8
6
4
2
0
72
73
74
75
76
77
78
79
YA
Histogram of YB
Normal
9
Mean
StDev
N
72.47
2.764
50
7
Frequency
6
5
4
3
2
1
0
68
70
72
YB
74
76
78
25
Distribution Plot
Poisson, Mean=1.02
0.4
Probability
0.3
0.2
0.1
0.0
FIGURE 1.6: Theoretical probability distribution function for a Poisson random variable with parameter = 1.02. Compare with the inclusions data histogram in Fig 1.3
1.4
26
Random Phenomena
REVIEW QUESTIONS
1. What decision is to be made in the yield improvement problem of Section 1.1?
2. What are the economic factors to be taken into consideration in deciding what
to do with the yield improvement problem?
3. What is the essence of the yield improvement problem as discussed in Section
1.1?
4. What are some of the sources of variability associated with the process yields?
5. Why are the yield variables, YA and YB , continuous variables?
6. What single value is suggested as intuitive for representing a collection of n
data points, x1 , x2 , . . . , xn ?
7. What are some of the issues raised by entertaining the idea of representing the
yield data sets with the arithmetic averages yA and yB ?
8. Why is the number of inclusions found on each glass sheet a discrete variable?
9. What are some sources of variability associated with the glass manufacturing process which may ultimately be responsible for the variability observed in the number
of inclusions?
10. What is a frequency distribution and how is it obtained from raw data?
11. Why will bin size aect the appearance of a frequency distribution?
12. What is a histogram and how is it obtained from data?
13. What is the primary advantage of a histogram over a table of raw data?
27
EXERCISES
Section 1.1
1.1 The variance of a collection of n data points, y1 , y2 , . . . , yn , is dened as:
n
)2
i=1 (yi y
(1.5)
s2 =
n1
where y is the arithmetic average of the data set. From the yield data in Table 1.1,
obtain the variances s2A and s2B for the YA and YB data sets, respectively. Which is
greater, s2A or s2B ?
1.2 Even though the data sets in Table 1.1 were not generated in pairs, obtain the
50 dierences,
di = yAi yBi ; i = 1, 2, . . . , 50,
(1.6)
for corresponding values of YA and YB as presented in this table. Obtain a histogram
of di and compute the arithmetic average,
n
1
di .
d =
n i=1
(1.7)
What do these results suggest about the possibility that YA may be greater than YB ?
1.3 A set of theoretical results to be established later (see Chapter 4 Exercises) state
that, for di and d dened in Eq (1.7), and variance s2 dened in Exercise 1,
d =
s2d
yA yB
(1.8)
s2A + s2B
(1.9)
28
Random Phenomena
two sets of histograms with the corresponding histograms in Figs 12.8 and 1.2.
1.7 From the frequency distribution in Table 12.3 and the values computed for the
average, yA , and variance, s2A of the yield data set, YA , determine the percentage of
the data contained in the interval yA 1.96sA , where sA is the positive square root
of the variance, s2A .
1.8 Repeat Exercise 1.7 for the YB data in Table 1.4. Determine the percentage of
the data contained in the interval yB 1.96sB .
1.9 From Table 1.5 determine the value of x such that only 5% of the data exceeds
this value.
1.10 Using = 75.52 and = 1.43, compute theoretical values of the function in
Eq 4.155 at the center points of the frequency groups for the YA data in Table 12.3;
i.e., for y = 72, 73, . . . , 79. Compare these theoretical values with the corresponding
relative frequency values.
1.11 Repeat Exercise 1.10 for YB data and Table 1.4.
1.12 Using = 1.02, compute theoretical values of the function f (x|) in Eq 4.40
at x = 0, 1, 2, . . . 6 and compare with the corresponding relative frequency values in
Table 1.5.
APPLICATION PROBLEMS
1.13 The data set in the table below is the time (in months) from receipt to publication (sometimes known as time-to-publication) of 85 papers published in the January
2004 issue of a leading chemical engineering research journal.
19.2
9.0
17.2
8.2
4.5
13.5
20.7
7.9
19.5
8.8
18.7
7.4
9.7
13.7
8.1
8.4
10.8
15.1
5.3
12.0
3.0
18.5
5.8
6.8
14.5
3.3
11.1
16.4
7.3
7.4
7.3
5.2
10.2
3.1
9.6
12.9
17.3
6.0
24.3
21.3
19.3
2.5
9.1
8.1
9.8
15.4
15.7
8.2
8.8
7.2
12.8
4.2
4.2
7.8
9.5
3.9
8.7
5.9
5.3
1.8
10.1
10.0
18.7
5.6
3.3
7.3
11.3
2.9
5.4
15.2
8.0
11.7
17.2
4.0
3.8
7.4
5.3
10.6
15.2
11.5
5.9
20.1
12.2
12.0
8.8
(i) Generate a histogram of this data set. Comment on the shape of this histogram
29
and why, from the nature of the variable in question, such a shape may not be
surprising.
(ii) From the histogram of the data, what is the most popular time-to-publication,
and what fraction of the papers took longer than this to publish?
1.14 Refer to Problem 1.13. Let each raw data entry in the data table be xi .
(i) Generate a set of 85 sample average publication time, yi , from 20 consecutive
times as follows:
y1
20
1
xi
20 i=1
(1.10)
y2
21
1
xi
20 i=2
(1.11)
y3
...
yj
1
xi
20 i=3
...
20+(j1)
1
22
20
(1.12)
xi
(1.13)
i=j
For values of j 66, yj should be obtained by replacing x86 , x87 , x88 , . . . , which do
not exist, with x1 , x2 , x3 , . . . , respectively (i.e., for these purposes treat the given
xi data like a circular array). Plot the histogram for this generated yi data and
compare the shape of this histogram with that of the original xi data.
(ii) Repeat part (i) above, this time for zi data generated from:
zj =
1
20
20+(j1)
yi
(1.14)
i=j
for j = 1, 2, . . . , 85. Compare the histogram of the zi data with that of the yi data
and comment on the eect of averaging on the shape of the data histograms.
1.15 The data shown in the table below is a four-year record of the number of
recordable safety incidents occurring at a plant site each month.
1
0
2
0
0
1
2
1
0
0
0
0
0
1
1
0
2
0
2
0
2
0
0
0
0
0
1
0
0
0
2
0
0
0
1
1
1
0
1
0
0
0
0
0
1
1
0
1
(i) Find the average number of safety incidents per month and the associated variance. Construct a frequency table of the data and plot a histogram.
(ii) From the frequency table and the histogram, what can you say about the
chances of obtaining each of the following observations, where x represents the
number of observed safety incidents per month: x = 0, x = 1, x = 2, x = 3, x = 4
and x = 5?
(iii) Consider the postulate that a reasonable model for this phenomenon is:
f (x) =
e0.5 0.5x
x!
(1.15)
30
Random Phenomena
1
272
263
11
215
206
2
319
313
12
245
235
3
253
251
13
248
237
4
325
312
14
364
350
5
236
227
15
301
288
6
233
227
16
203
195
7
300
290
17
197
193
8
260
251
18
217
216
9
268
262
19
210
202
10
276
263
20
223
214
yO
Total no. of
older patients
(out of 100)
with pregnancy outcome x
32
41
21
5
1
0
yY
Total no. of
younger patients
(out of 100)
with pregnancy outcome x
8
25
35
23
8
1
The data shows x, the number of live births per delivered pregnancy, along
with how many in each group had the pregnancy outcome of x. For example, the
rst entry indicates that the IVF treatment was unsuccessful for 32 of the older
patients, with the corresponding number being 8 for the younger patients; 41
older patients delivered singletons, compared with 25 for the younger patients; 21
older patients and 35 younger patients each delivered twins; etc. Obtain a relative
frequency distribution for these data sets and plot the corresponding histograms.
Determine the average number of live births per delivered pregnancy for each group
31
and compare these values. Comment on whether or not these data sets indicate that
the outcomes of the IVF treatments are dierent for these two groups.
32
Random Phenomena
Chapter 2
Random Phenomena, Variability and
Uncertainty
2.1
2.2
2.3
2.4
2.5
34
34
35
35
37
37
41
41
42
42
43
43
44
45
46
46
47
48
49
50
51
When John Stuart Mills stated in his 1862 book, A System of Logic: Ratiocinative and Inductive, that ...the very events which in their own nature
appear most capricious and uncertain, and which in any individual case no
attainable degree of knowledge would enable us to foresee, occur, when considerable numbers are taken into account, with a degree of regularity approaching
to mathematical ..., he was merely articulatingastutely for the timethe
then-radical, but now well-accepted, concept that randomness in scientic observation is not a synonym for disorder; it is order of a dierent kind. The more
familiar kind of order informs determinism: the concept that, with sucient
mechanistic knowledge, all physical phenomena are entirely predictable and
thus describable by precise mathematical equations. But even classical physics,
that archetypal deterministic science, had to make room for this other kind
33
34
Random Phenomena
2.1
2.1.1
(2.2)
The rate of heat loss is determined precisely and consistently for any given
specic values of each entity on the right hand side of this equation.
The concept of determinism, that the phenomenon in question is precisely
determinable in every relevant detail, is central to much of science and engineering and has proven quite useful in analyzing real systems, and in solving practical problemswhether it is computing the trajectory of rockets for
35
=
=
A + Ai
B + Bi
(2.3)
(2.4)
with A and B representing the true but unknown yields obtainable from
processes A and B respectively, and Ai and Bi representing the superimposed
randomly varying componentthe sources of the random variability evident
in each observation yAi and yBi . Identical values of A do not produce identical
values of yAi in Eq (2.3); neither will identical values of B produce identical
values of yBi in Eq (2.4). In the second case of the glass process and the
number of inclusions per square meter, the idealization is:
xi = + i
(2.5)
where is the true number of inclusions associated with the process and i is
the superimposed random component responsible for the observed randomness
in the actual number of inclusion xi found on each individual glass sheet upon
inspection.
These two perspectives, determinism and randomness, are thus two
opposite idealizations of natural phenomena, the former when deterministic
aspects of the phenomenon are considered to be overwhelmingly dominant
over any random components, the latter case when the random components
are dominant and central to the problem. The principles behind each conceptual idealization, and the analysis technique appropriate to each, are now
elucidated with a chemical engineering illustration.
2.1.2
36
Random Phenomena
C0 G(t)
Fluid
element
F m3/s
lm
lA
F
secs
(2.7)
for each dye element to traverse the reactor. Hence, , the residence time for
an ideal plug ow reactor (PFR) is a deterministic quantity because its value
is exactly and precisely determinable from Eq (2.7) given F, A and l.
Keep in mind that the determinism that informs this analysis of the PFR
37
C0 G(t)
F m3/s
Volume
V m3
residence time arises directly as a consequence of the central plug ow idealization. Any departures from such idealization, especially the presence of
signicant axial dispersion (leading to a non-at uid velocity prole), will
result in dye molecules no longer arriving at the outlet at precisely the same
time.
Randomness and the CSTR
With the continuous stirred tank reactor (CSTR), the reactant stream
continuously ows into a tank that is vigorously stirred to ensure uniform
mixing of its content, while the product is continuously withdrawn from the
outlet (see Fig 2.2). The assumptions (idealizations) in this case are:
the reactor tank has a xed, constant volume, V m3 ;
the contents of the tank are perfectly mixed.
Once again, let us consider that a bolus of red dye of concentration C0
moles/m3 is instantaneously injected into the inlet stream at time t = 0; and
again, ask: how much time does each molecule of red dye spend in the reactor?
Unlike with the plug ow reactor, observe that it is impossible to answer this
question `
a-priori, or precisely: because of the vigorous stirring of the reactor
content, some dye molecules will exit almost instantaneously; others will stay
longer, some for a very long time. In fact, it can be shown that theoretically,
0 < < . Hence in this case, , the residence time, is a randomly varying
quantity that can take on a range of values from 0 to ; it cannot therefore be
adequately characterized as a single number. Notwithstanding, as all chemical
engineers know, the random phenomenon of residence times for ideal CSTRs
can, and has been, analyzed systematically (see for example, Hill, 19771).
1 C.G. Hill, Jr, An Introduction to Chemical Engineering Kinetics and Reactor Design,
Wiley, NY, 1977, pp 388-396.
38
Random Phenomena
dC
= F Cin F C
dt
(2.8)
where Cin is the dye concentration in the inlet stream. If we dene the parameter as
V
=
(2.9)
F
and note that the introduction of a bolus of dye of concentration C0 at t = 0
implies:
(2.10)
Cin = C0 (t)
where (t) is the Dirac delta function, then Eq (2.8) becomes:
dC
= C + C0 (t)
dt
(2.11)
C0 t/
e
(2.12)
1 /
e
(2.14)
recognizable to all chemical engineers as the familiar exponential instantaneous residence time distribution function for the ideal CSTR. The reader
39
0.20
fT
0.15
0.10
0.05
0.00
10
15
20
25
30
35
FIGURE 2.3: Instantaneous residence time distribution function for the CSTR:
(with = 5).
should take good note of this expression: it shows up a few more times and
in various guises in subsequent chapters. For now, let us observe that, even
though (a) the residence time for a CSTR, , exhibits random variability,
potentially able to take on values between 0 and (and is therefore not describable by a single value); so that (b) it is therefore impossible to determine
with absolute certainty precisely when any individual dye molecule will leave
the reactor; even so (c) the function, f (), shown in Eq (4.41), mathematically
characterizes the behavior of the entire ensemble of dye molecules, but in a
way that requires some explanation:
1. It represents how the residence times of uid particles in the well-mixed
CSTR are distributed over the range of possible values 0 < <
(see Fig 2.3).
2. This distribution of residence times is a well-dened, well-characterized
function, but it is not a description of the precise amount of time a particular individual dye molecule will spend in the reactor; rather it is a
description of how many (or what fraction) of the entire collection of
dye molecules will spend what amount of time in the reactor. For example, in broad terms, it indicates that a good fraction of the molecules
have relatively short residence times, exiting the reactor quickly; a much
smaller but non-zero fraction have relatively long residence times. It can
also provide more precise statements as follows.
3. From this expression (Eq (4.41)), we can determine the fraction of dye
molecules that have remained in the reactor for an amount of time less
than or equal to some time t, (i.e. molecules exiting the reactor with
40
Random Phenomena
age less than or equal to t): we do this by integrating f () with respect
to , as follows, to obtain
t
1 /
F (t) =
e
d = 1 et/
(2.15)
0
from which we see that F (0), the fraction of dye molecules with age less
than or equal to zero is exactly zero: indicating the intuitively obvious
that, no matter how vigorous the mixing, each dye molecule spends at
least a nite, non-zero, amount of time in the reactor (no molecule exits
instantaneously upon entry).
On the other hand, F () = 1, since
1 /
e
F () =
d = 1
(2.16)
again indicating the obvious: if we wait long enough, all dye molecules
will eventually exit the reactor as t . In other words, the fraction
of molecules exiting the reactor with age less than is exactly 1.
4. Since the fraction of molecules that will have remained in the reactor
for an amount of time less than or equal to t is F (t), and the fraction
that will have remained in the reactor for less than or equal to t + t
is F (t + t), then the fraction with residence time in the innitesimal
interval between t and t + t) is given by:
t+t
[t (t + t)] = F (t + t) F (t) =
t
1 /
e
d
(2.17)
(2.18)
5. And nally, the average residence time may be determined from the
expression in Eq (4.41) (and Eq (2.16)) as:
1 /
1
/
e
d
d
0
0 e
=
= 1 /
=
(2.19)
1
d
0 e
where the numerator integral is evaluated via integration by parts. Observe from the denition of above (in Eq (2.9)) that this result makes
perfect sense, strictly from the physics of the problem: particles in a
stream owing at the rate F m3 /s through a well-mixed reactor of volume V m3 , will spend an average of V /F = seconds in the reactor.
We now observe in conclusion two important points: (i) even though at no
point in the preceding discussion have we made any overt or explicit appeal
41
2.2
2.2.1
In such diverse areas as actuarial science, biology, chemical reactors, demography, economics, nance, genetics, human mortality, manufacturing quality assurance, polymer chemistry, etc., one repeatedly encounters a surprisingly common theme whereby phenomena which, on an individual level, appear entirely unpredictable, are well-characterized as ensembles (as demonstrated above with residence time distribution in CSTRs). For example, as
far back as 1662, in a study widely considered to be the genesis of population
demographics and of modern actuarial science by which insurance premiums
are determined today, the British haberdasher, John Graunt (1620-1674), had
observed that the number of deaths and the age at death in London were surprisingly predictable for the entire population even though it was impossible to
predict which individual would die when and in what manner. Similarly, while
the number of monomer molecules linked together in any polymer molecule
chain varies considerably, how many chains of a certain length a batch of
polymer product contains can be characterized fairly predictably.
Such natural phenomena noted above have come to be known as Random
Mass Phenomena, with the following dening characteristics:
1. Individual observations appear irregular because it is not possible to
predict each one with certainty; but
2. The ensemble or aggregate of all possible outcomes is regular, wellcharacterized and determinable;
3. The underlying phenomenological mechanisms accounting for the nature and occurrence of the specic observations determines the character of the ensemble;
4. Such phenomenological mechanisms may be known mechanistically (as
was the case with the CSTR), or its manifestation may only be deter-
42
Random Phenomena
mined from data (as was the case with John Graunts mortality tables
of 1662).
2.2.2
While ensemble characterizations provide a means of dealing systematically with random mass phenomena, many practical problems still involve
making decisions about specic, inherently unpredictable, outcomes. For example, the insurance company still has to decide what premium to charge each
individual on a person-by-person basis. When decisions must be made about
specic outcomes of random mass phenomena, uncertainty is an inevitable
consequence of the inherent variability. Furthermore, the extent or degree
of variability directly aects the degree of uncertainty: tighter clustering of
possible outcomes implies less uncertainty, whereas a broader distribution of
possible outcomes implies more uncertainty. The most useful mathematical
characterization of ensembles must therefore permit not only systematic analysis, but also a rational quantication of the degree of variability inherent in
the ensemble, and the resulting uncertainty associated with each individual
observation as a result.
2.2.3
43
Embedded in these questions are the following aliated questions that arise
as a consequence: (a) how was {xi }ni=1 obtained in (1); will the procedure for
obtaining the data aect how well we can answer question 1? (b) how was
f (x) determined in (2)?
Subsequent chapters are devoted to dealing with these fundamental problems systematically and in greater detail.
2.3
2.3.1
Introducing Probability
Basic Concepts
e x
; x = 0, 1, 2, . . .
x!
(2.20)
44
Random Phenomena
TABLE 2.1:
Computed
probabilities of occurrence of
various number of inclusions
for = 2 in Eq (9.2)
x = No of f (x) prob of
inclusions occurrence
0
0.135
0.271
1
2
0.271
3
0.180
0.090
4
5
0.036
..
..
.
.
0.001
8
9
0.000
2.3.2
Interpreting Probability
45
is assigned to indicate the degree of uncertainty associated with the occurrence of a particular outcome. As with temperature the conceptual quantity,
how a numerical value is determined for the probability of the occurrence
of a particular outcome under any specic circumstance depends on the circumstance itself. To carry the analogy with temperature a bit further: while
a thermometer capable of determining temperature to within half a degree
will suce in one case, a more precise device, such as a thermocouple, may
be required in another case, and an optical pyrometer for yet another case.
Whatever the case, under no circumstance should the device employed to determine its numerical value usurp the role of, or become the surrogate for,
temperature the quantity. This is important in properly interpreting probability, the conceptual entity: how an appropriate value is to be determined
for probability, an important practical problem in its own right, should not
be confused with the quantity itself.
With these ideas in mind, let us now consider several standard perspectives
of probability that have evolved over the years. These are best understood as
various techniques for how numerical values are determined rather than what
probability is.
`
Classical (A-Priori)
Probability
Consider a random phenomenon for which the total number of possible
outcomes is known to be N , all of which are equally likely; of these, let NA
be the number of outcomes in which A is observed (i.e. outcomes that are
favorable to A). Then according to the classical (or `
a-priori) perspective,
the probability of the occurrence of outcome A is dened as
P (A) =
NA
N
(2.21)
For example, in tossing a single perfect die once, the probability of observing
a 3 is, according to this viewpoint, evaluated as 1/6, since the total number of
possible outcomes is 6 of which only 1 is favorable to the desired observation
of 3. Similarly, if B is the outcome that one observes an odd number of dots,
then P (B) = 3/6 = 0.5.
Observe that according to this view, no experiments have been performed
yet; the formulation is based entirely on an `
a-priori enumeration of N and
NA . However, this intuitively appealing perspective is not always applicable:
What if all the outcomes are not equally likely?
How about random phenomena whose outcomes cannot be characterized
as cleanly in this fashion, say, for example, the prospect of a newly purchased refrigerator lasting for 25 years without repair? or the prospect
of snow falling on a specic April day in Wisconsin?
What Eq. (2.21) provides is an intuitively appealing (and theoretically sound)
means of determining an appropriate value for P (A); but it is restricted only
46
Random Phenomena
to those circumstances where the random phenomenon in question is characterized in such a way that N and NA are natural and easy to identify.
`
Relative Frequency (A-Posteriori)
Probability
On the opposite end of the spectrum from the `
a-priori perspective is the
following alternative: consider an experiment that is repeated n times under identical conditions, where the outcomes involving A have been observed
a-posteriori, the probability of the occurrence of
to occur nA times. Then, `
outcome A is dened as
nA
P (A) = lim
(2.22)
n n
The appeal of this viewpoint is not so much that it is just as intuitive as the
previous one, but that it is also empirical, making no assumptions about equal
likelihood of outcomes. It is based on the actual performance of experiments
and the actual `
a-posteriori observation of the relative frequency of occurrences
of the desired outcome. This perspective provides a prevalent interpretation
of probability as the theoretical value of long range relative frequencies. In
fact, this is what motivates the notion of the theoretical distribution as the
limiting form to which the empirical frequency distribution tends with the
acquisition of increasing amounts of data.
However, this perspective also suers from some limitations:
How many trials, n, is sucient for Eq (2.22) to be useful in practice?
How about random phenomena for which the desired outcome does not
lend itself to repetitive experimentation under identical conditions, say,
for example, the prospect of snow falling on a specic April day in Wisconsin? or the prospect of your favorite team winning the basketball
championship next year?
Once again, these limitations arise primary because Eq (2.22) is simply just
another means of determining an appropriate value for P (A) that happens
to be valid only when the random phenomenon is such that the indicated repeated experimentation is not only possible and convenient, but for which, in
practice, truncating after a suciently large number of trials to produce a
nite approximation presents no conceptual dilemma. For example, after tossing a coin 500 times and obtaining 251 heads, declaring that the probability
of obtaining a head upon a single toss as 0.5 presents no conceptual dilemma
whatsoever.
Subjective Probability
There is yet another alternative perspective whereby P (A) is taken simply
as a measure of the degree of (personal) belief associated with the postulate
that A will occur, the value having been assigned subjectively by the individual concerned, akin to betting odds. Thus, for example, in rolling a perfect
die, the probability of obtaining a 3 is assigned strictly on the basis of what the
47
2.4
Beginning with the next chapter, Part II is devoted to an axiomatic treatment of probability, including basic elements of probability theory, random
variables, and probability distribution functions, within the context of a comprehensive framework for systematically analyzing random phenomena.
The central conceptual elements of this framework are: (i) a formal representation of uncertain outcomes with the random variable, X; and (ii) the
mathematical characterization of this random variable by the probability distribution function (pdf), f (x). How the probabilities are distributed over the
entire aggregate collection of all possible outcomes, expressed in terms of the
random variable, X, is contained in this pdf. The following is a procedure for
problem-solving within this framework:
1. Problem Formulation: Dene and formulate the problem appropriately.
Examine the random phenomenon in question, determine the random
variable(s), and assemble all available information about the underlying
mechanisms;
2. Model Development : Identify, postulate, or develop an appropriate ideal
model of the relevant random variability in the form of the probability
distribution function f (x);
3. Problem Solution: Use the model to solve the relevant problem (analysis,
prediction, inference, estimation, etc.);
4. Results validation: Analyze and validate the result and, if necessary,
return to any of the preceding steps as appropriate.
48
Random Phenomena
2.5
49
REVIEW QUESTIONS
1. If not a synonym for disorder, then what is randomness in scientic observation?
2. What is the concept of determinism?
3. Why are the expressions in Eqs (16.2) and (2.2) considered deterministic?
4. What is an example phenomenon that had to be ignored in order to obtain the
deterministic expressions in Eq (16.2)? And what is an example phenomenon that
had to be ignored in order to obtain the deterministic expressions in Eq (2.2)?
5. What are the main characteristics of randomness as described in Subsection
2.1.1?
6. Compare and contrast determinism and randomness as two opposite idealizations
of natural phenomena.
7. Which idealized phenomenon does residence time in a plug ow reactor (PFR)
represent?
8. What is the central plug ow idealization in a plug ow reactor, and how will
departures from such idealization aect the residence time in the reactor?
9. Which idealized phenomenon does residence time in a continuous stirred-tank
reactor (CSTR) represent?
10. On what principle is the mathematical model in Eq (2.8) based?
11. What does the expression in Eq (4.41) represent?
12. What observation by John Graunt is widely considered to be the genesis of
population demographics and of modern actuarial science?
13. What are the dening characteristics of random mass phenomena?
50
Random Phenomena
EXERCISES
Section 2.1
2.1 Solve Eq (2.11) explicitly to conrm the result in Eq (2.12).
2.2 Plot the expression in Eq (2.15) as a function of the scaled time variable, t = t/ ;
determine the percentage of dye molecules with age less than or equal to the mean
residence time, .
2.3 Show that
1 /
e
d =
(2.23)
and
x2
1
f (x) = e 18 ; < x <
3 2
(2.24)
1 y2
f (y) = e 2 ; < y <
2
(2.25)
represent how the occurrences of all the possible outcomes of the two randomly
varying, continuous variables, X and Y , are distributed. Plot these two distribution
51
functions on the same graph. Which of these variables has a higher degree of uncertainty associated with the determination of any particular outcome. Why?
2.5 When a fair coin is tossed 4 times, it is postulated that the probability of
obtaining x heads is given by the probability distribution function:
f (x) =
4!
0.54
x!(4 x)!
(2.26)
APPLICATION PROBLEMS
2.7 For each of the following two-reactor congurations:
(a) two plug ow reactors in series where the length of reactor 1 is l1 m, and
that of reactor 2 is l2 m, but both have the same uniform cross-sectional area
A m2 ;
(b) two continuous stirred tank reactors with volumes V1 and V2 m3 ;
(c) the PFR in Fig 2.1 followed by the CSTR in Fig 2.2;
given that the ow rate through each reactor ensemble is constant at F m3 /s, obtain
the residence time, , or the residence time distribution, f (), as appropriate. Make
any assumption you deem appropriate about the concentration C1 (t) and C2 (t) in
the rst and second reactors, respectively.
2.8 In the summer of 1943 during World War II, a total of 365 warships were attacked
by Kamikaze pilots: 180 took evasive action and 60 of these were hit; the remaining
185 counterattacked, of which 62 were hit. Using a relative frequency interpretation
and invoking any other assumption you deem necessary, determine the probability
that any attacked warship will be hit regardless of tactical response. Also determine
the probability that a warship taking evasive action will be hit and the probability
that a counterattacking warship will be hit. Compare these three probabilities and
discuss what this implies regarding choosing an appropriate tactical response. (A
full discussion of this problem is contained in Chapter 7.)
2.9 Two American National Football League (NFL) teams, A and B, with respective Win-Loss records 9-6 and 12-3 after 15 weeks, are preparing to face each other
in the 16th and nal game of the regular season.
(i) From a relative frequency perspective of probability, use the supplied information
(and any other assumption you deem necessary) to compute the probability of Team
A winning any generic game, and also of Team B winning any generic game.
52
Random Phenomena
(ii) When the two teams play each other, upon the presupposition that past record
is the best indicator of a teams chances of winning a new game, determine reasonable values for P (A), the probability that team A wins the game, and P (B), the
probability that team B wins, assuming that this game does not end up in a tie.
Note that for this particular case,
P (A) + P (B) = 1
(2.27)
Part II
Probability
53
55
56
Chapter 3
Fundamentals of Probability Theory
3.1
3.2
3.3
3.4
3.5
3.6
Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Events, Sets and Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Set Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Probability Set Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.4 Final considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 The Calculus of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Illustrating the Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Formalizing the Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.3 Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.4 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
60
61
64
67
68
69
69
71
72
72
73
74
76
77
78
79
80
84
The paradox of randomly varying phenomena that the aggregate ensemble behavior of unpredictable, irregular, individual observations is stable and
regular provides a basis for developing a systematic analysis approach.
Such an approach requires temporarily abandoning the futile task of predicting individual outcomes and instead focussing on characterizing the aggregate
ensemble in a mathematically appropriate manner. The central element is a
machinery for determining the mathematical probability of the occurrence
of each outcome and for quantifying the uncertainty associated with any attempts at predicting the intrinsically unpredictable individual outcomes. How
this probability machinery is assembled from a set of simple building blocks
and mathematical operations is presented in this chapter, along with the basic concepts required for its subsequent use for systematic analysis of random
57
58
Random Phenomena
3.1
Building Blocks
59
60
Random Phenomena
The set B = {T T T } consists of the only outcome involving the
occurrence of 3 tails; it therefore represents the event that 3 tails are
observed.
The set C = {HHH, HHT, HT H, T HH} consists of the outcomes
involving the occurrence of at least 2 heads; it represents the event that
at least 2 heads are observed.
Similarly, the set D = {HHH} represents the event that 3 heads
are observed.
A simple or elementary event is one that consists of one and only one
outcome of the experiment; i.e. a set with only one element. Thus, in Example 3.2, set B and set D are examples of elementary events. Any other event
consisting of more than one outcome is a complex or compound event. Sets A
and C in Example 3.2 are compound events. (One must be careful to distinguish between the set and its elements. The set B in Example 3.2 contains
one element, TTT, but the set is not the same as the element. Thus, even
though the elementary event consists of a single outcome, one is not the same
as the other).
Elementary events possess an important property that is crucial to the
development of probability theory:
An experiment conducted once produces one and only one outcome;
The elementary event consists of only one outcome;
One and only one elementary event can occur for every experimental
trial;
Therefore:
Simple (elementary) events are mutually exclusive.
In Example 3.2, sets B and D represent elementary events; observe that if one
occurs, the other one cannot. Compound events do not have this property. In
this same example, observe that if, after a trial, the outcome is HTH (a tail
sandwiched between two heads), event A has occurred (we have observed
precisely 2 heads), but so has event C, which requires observing 2 or more
heads. In the language of sets, the element HTH belongs to both set A and
set B.
An elementary event therefore consists of a single outcome and cannot be decomposed into a simpler event; a compound event, on the other
hand, consists of a collection of more than one outcome and can therefore be
composed from several simple events.
3.2
61
Operations
3.2.1
We earlier dened the sample space as a set whose elements are all the
possible outcomes of an experiment. Events are also sets, but they consist of
only certain elements from that share a common attribute. Thus,
Of all the subsets of , there are two special ones with important connotations: , the empty set consisting of no elements at all, and itself. In
the language of events, the former represents the impossible event, while the
latter represents the certain event.
Since they are sets, events are amenable to analysis using precisely the
same algebra of set operations union, intersection and complement
which we now briey review.
1. Union: A B represents the set of elements that are either in A or B. In
general,
A1 A2 A3 . . . Ak =
Ai
(3.2)
i=1
is the set of elements that are in at least one of the k sets, {Ai }k1 .
2. Intersection: A B represents the set of elements that are in both A and
62
Random Phenomena
B. In general,
A1 A2 A3 . . . Ak =
Ai
(3.3)
i=1
is the set of elements that are common to all the k sets, {Ai }k1 .
To discuss the third set operation requires two special sets: The universal set (or universe), typically designated , and the null (or empty) set,
typically designated . The universal set consists of all possible elements of
interest, while the null set contains no elements. (We have just recently introduced such sets above but in the specic context of the sample space of an
experiment; the current discussion is general and not restricted to the analysis
of randomly varying phenomena and their associated sample spaces.)
These sets have the special properties that for any set A,
A
A
=
=
A;
(3.4)
(3.5)
(3.6)
(3.7)
and
(A )
(A B)
(A B)
= ; =
= A;
= A B
= A B
(3.8)
(3.9)
(3.10)
(3.11)
63
TABLE 3.1:
Certain event
Impossible event
A
Non-occurrence of event A
Event A or B
AB
AB
Events A and B
The following table presents some information about the nature of subsets of
interpreted in the language of events.
Note in particular that if A B = , A and B are said to be disjoint sets
(with no elements in common); in the language of events, this implies that
event A occurring together with event B is impossible. Under these circumstances, events A and B are said to be mutually exclusive.
Example 3.3 PRACTICAL ILLUSTRATION OF SETS AND
EVENTS
Samples from various batches of a polymer resin manufactured at a plant
site are tested in a quality control laboratory before release for sale. The
result of the tests allows the manufacturer to classify the product into
the following 3 categories:
1. Meets or exceeds quality requirement; Assign #1; approve for sale
as 1st quality.
2. Barely misses quality requirement; Assign #2; approve for sale as
2nd grade at a lower price.
3. Fails completely to meet quality requirement; Assign #3; reject as
poor grade and send back to be incinerated.
Identify the experiment, outcome, trial, sample space and the events
associated with this practical problem.
Solution:
1. Experiment: Take a sample of polymer resin and carry out the
prescribed product quality test.
2. Trial: Each trial involves taking a representative sample from each
polymer resin batch and testing it as prescribed.
3. Outcomes: The assignment of a number 1, 2, or 3 depending on
how the result of the test compares to the product quality requirements.
4. Sample space: The set = {1, 2, 3} containing all possible outcomes.
5. Events: The subsets of the sample space are identied as follows:
E0 = {}; E1 = {1}; E2 = {2}; E3 = {3}; E4 = {1, 2}; E5 =
{1, 3}; E6 = {2, 3}; E7 = {1, 2, 3}. Note that there are 8 in all. In
general, a set with n distinct elements will have 2n subsets.
64
Random Phenomena
E1 E2
(3.12)
E5
E1 E3
(3.13)
E6
E2 E3
(3.14)
E7
E1 E2 E3
(3.15)
65
TABLE 3.2:
Name
Allison
Ben
Chrissy
Daoud
Evan
Fouad
Gopalan
Helmut
Ioannis
Jim
Katie
Larry
Moe
Nathan
Olu
3.2.2
Set Functions
A function F (.), dened on the subsets of such that it assigns one and
only one real number to each subset of , is known as a set function. By
this denition, no one subset can be assigned more than one number by a set
function. The following examples illustrate the concept.
Example 3.6 SET FUNCTIONS DEFINED ON THE SET OF
STUDENTS IN A CLASSROOM
The following table shows a list of attributes associated with 15 students
in attendance on a particular day in a 600 level course oered at the
University of Delaware. Let set A be the subset of female students and
B, the subset of male students. Obtain the real number assigned by the
following set functions:
1. N (A), the total number of female students in class;
2. N (), the total number of students in class;
3. M (B), the sum total amount of money carried by the male students;
4. H(A),
the average height (in inches) of female students;
5. Y + (B), the maximum age, in years, of male students
Solution:
1. N (A) = 3;
2. N () = 15;
3. M (B) = $293.00;
4. H(A)
= 67 ins.
66
Random Phenomena
B
A
3
6
37
(3.18)
(3.19)
67
so that there are 46 parts that are either defective or from the old batch.
3.2.3
Let P (.) be an additive set function dened on all subsets of , the sample
space of all the possible outcomes of an experiment, such that:
1. P (A) 0 for every A ;
2. P () = 1;
3. P (A B) = P (A) + P (B) for all mutually exclusive events A and B
then P (.) is a probability set function.
Remarkably, these three simple rules (axioms) due to Kolmogorov, are
sucient to develop the mathematical theory of probability. The following
are important properties of P (.) arising from these axioms.
1. To each event A, it assigns a non-negative number, P (A), its probability;
2. To the certain event , it assigns unit probability;
3. The probability that either one or the other of two mutually exclusive
events A, B will occur is the sum of the probabilities that each event
will occur.
The following corollaries are important consequences of the foregoing three
axioms:
Corollary 1. P (A ) = 1 P (A).
The probability of non-occurrence of A is 1 minus the probability of its occurrence. Equivalently, the combined probability of the occurrence of an event
and of its non-occurrence add up to 1. This follows from the fact that
= A A ;
(3.20)
that A and A are disjoint sets; that P (.) is an additive set function, and that
P () = 1.
Corollary 2. P () = 0.
The probability of an impossible event occurring is zero. This follows from the
fact that = and from corollary 1 above.
Corollary 3. A B P (A) P (B).
If A is a subset of B then the probability of occurrence of A is less than, or
equal to, the probability of the occurrence of B. This follows from the fact
that under these conditions, B can be represented as the union of 2 disjoint
sets:
(3.21)
B = A (B A )
68
Random Phenomena
(3.22)
(3.23)
3.2.4
Final considerations
Thus far, in assembling the machinery for dealing with random phenomena
by characterizing the aggregate ensemble of all possible outcomes, we have
encountered the sample space , whose elements are all the possible outcomes
of an experiment; we have presented events as collections of these outcomes
(and hence subsets of ); and nally P (.), the probability set function dened
on subsets of , allows the axiomatic denition of the probability of an event.
What we need next is a method for actually obtaining any particular probability P (A) once the event A has been dened. Before we can do this, however,
for completeness, a set of nal considerations are in order.
Even though as presented, events are subsets of , not all subsets of
are events. There are all sorts of subtle mathematical reasons for this, including the (somewhat unsettling) case in which consists of innitely many
elements, as is the case when the outcome is a continuous entity and can
therefore take on values on the real line. In this case, clearly, is the set of
all real numbers. A careful treatment of these issues requires the introduction
of Borel elds (see for example, Kingman and Taylor, 1966, Chapter 111 ).
This is necessary because, as the reader may have anticipated, the calculus of
probability requires making use of set operations, unions and intersections, as
well as sequences and limits of events. As a result, it is important that sets
resulting from such operations are themselves events. This is strictly true of
Borel elds.
Nevertheless, for all practical purposes, and most practical applications, it
is often not necessary to distinguish between the subsets of and genuine
events. For the reader willing to accept on faith the end resultthe probability
1 Kingman, J.F.C. and Taylor, S.J., Introduction to the Theory of Measure and Probability, Cambridge University Press, 1966.
69
3.3
Probability
3.3.1
Once the sample space for any random experiment has been specied
and the events (subsets of the sample space) identied, the following is the
procedure for determining the probability of any event A, based on the important property that elementary events are mutually exclusive:
(3.24)
(3.25)
B = {d3 , d5 , . . . , dN }
(3.26)
P (B) = 1 p1 p2 p4
(3.27)
and for
then
The following examples illustrate how probabilities pi may be assigned to
elementary events.
70
Random Phenomena
Example 3.8 ASSIGNMENTS FOR EQUIPROBABLE OUTCOMES
The experiment of tossing a coin 3 times and recording the observed
number of heads and tails was considered in Examples 3.1 and 3.2.
There the sample space was obtained in Eq (4.5) as:
= {HHH, HHT, HT H, T HH, HT T, T HT, T T H, T T T },
(3.28)
a set with 8 elements that comprise all the possible outcomes of the experiment. Several events associated with this experiment were identied
in Example 3.2.
If there is no reason for any one of the 8 possible outcomes to be
any more likely to occur that any other one, the outcomes are said to
be equiprobable and we assign a probability of 1/8 to each one. This
gives rise to the following equiprobale assignment of probability to the
8 elementary events:
Note that
P (E1 )
P {HHH} = 1/8
P (E2 )
P {HHT } = 1/8
P (E3 )
..
.
P {HT H} = 1/8
P (E7 )
P {T T H} = 1/8
P (E8 )
P {T T T } = 1/8
8
1
pi =
8
P (Ei ) = 1
(3.29)
(3.30)
(3.31)
identied in Example 3.2 (the event that exactly 2 heads are observed)
consists of three elementary events E2 , E3 and E4 , so that
A = E2 E3 E4 ,
(3.32)
(3.33)
Other means of probability assignment are possible, as illustrated by the following example.
`
Example 3.9 ALTERNATIVE ASSIGNMENTS FROM APRIORI KNOWLEDGE
Consider the manufacturing example discussed in Examples 3.3 and 3.4.
71
0.75
(3.34)
P (E2 )
0.15
(3.35)
P (E3 )
0.10
(3.36)
(3.37)
P (E5 )
(3.38)
P (E6 )
(3.39)
3.3.2
Implications
72
Random Phenomena
G (Graduate)
12
U (Undergraduate)
38
10
3.4
3.4.1
Conditional Probability
Illustrating the Concept
Consider a chemical engineering thermodynamics class consisting of 50 total students of which 38 are undergraduates and the rest are graduate students.
Of the 12 graduate students, 8 are chemistry students; of the 38 undergraduates, 10 are chemistry students. We may dene the following sets:
, the (universal) set of all students (50 elements);
G, the set of graduate students (12 elements);
C, the set of chemistry students (18 elements)
Note that the set G C, the set of graduate chemistry students, contains 8
elements. (See Fig 3.2.)
We are interested in the following problem: select a student at random;
given that the choice results in a chemistry student, what is the probability
that she/he is a graduate student? This is a problem of nding the probability
of the occurrence of an event conditioned upon the prior occurrence of another
one.
73
B
A
3.4.2
P (A)
P (A )
=
P ()
1
(3.41)
Returning now to the previous illustration, we see that the required quantity is P (G|C), and by denition,
P (G|C) =
8/50
P (G C)
=
= 8/18
P (C)
18/50
(3.42)
74
Random Phenomena
B
A
A B*
AB
(3.43)
(3.44)
3.4.3
Total Probability
It is possible to obtain total probabilities when only conditional probabilities are available. We now present some very important results relating
conditional probabilities to total probability.
Consider events A and B, not necessarily disjoint. From the Venn diagram
in Fig 3.4, we may write A as the union of 2 disjoint sets as follows:
A = (A B) (A B )
(3.45)
In words, this expression states that the points in A are made up of two
groups: the points in A that are also in B, and the points in A that are not
in B. And because the two sets are disjoint, so that the events they represent
are mutually exclusive, we have:
P (A) = P (A B) + P (A B )
(3.46)
75
A
A B2
A B3
A B1
B1
A Bk
B2
.....
B3
Bk
(3.47)
(3.48)
or, alternatively,
Bi
(3.49)
= B1 B2 B3 . . . Bk =
i=1
(3.50)
which is a partitioning of the set A as a union of k disjoint sets (See Fig 3.5).
As a result,
P (A) = P (A B1 ) + P (A B2 ) + . . . + P (A Bk )
(3.51)
P (A Bi ) = P (A|Bi )P (Bi )
(3.52)
but since
we immediately obtain
P (A) = P (A|B1 )P (B1 ) + P (A|B2 )P (B2 ) + . . . + P (A|Bk )P (Bk )
Thus:
P (A) =
k
i=1
P (A|Bi )P (Bi )
(3.53)
(3.54)
76
Random Phenomena
an expression that is sometimes referred to as the Theorem of total probability used to compute total probability P (A) from P (A|Bi ) and P (Bi ).
The following example provides an illustration.
Example 3.10 TOTAL PROBABILITY
A company manufactures light bulbs of 3 dierent types (T1 , T2 , T3 )
some of which are defective right from the factory. From experience
with the manufacturing process, it is known that the fraction of defective
Type 1 bulbs is 0.1; Types 2 and 3 have respective defective fractions
of 1/15 and 0.2.
A batch of 200 bulbs were sent to a quality control laboratory for
testing: 100 Type 1, 75 Type 2, and 25 Type 3. What is the probability
of nding a defective bulb?
Solution:
The supplied information may be summarized as follows: Prior conditional probabilities of defectiveness,
P (D|T1 ) = 0.1; P (D|T2 ) = 1/15; P (D|T3 ) = 0.2;
(3.55)
(3.56)
Assuming equiprobable outcomes, this number distribution immediately implies the following:
P (T1 ) = 100/200 = 0.5; P (T2 ) = 0.375; P (T3 ) = 0.125
(3.57)
3.4.4
Bayes Rule
P (Bi A)
P (A)
(3.59)
77
but
P (Bi A) = P (A Bi ) = P (A|Bi )P (Bi )
(3.60)
which, when substituted into (3.59), gives rise to a very important result:
P (A|Bi )P (Bi )
P (Bi |A) = k
i=1 P (A|Bi )P (Bi )
(3.61)
This famous result, due to the Revd. Thomas Bayes (1763), is known as
Bayes Rule and we will encounter it again in subsequent chapters. For now,
it is an expression that can be used to compute the (unknown) `
a-posteriori
probability P (Bi |A) of events Bi from the `
a-priori probabilities P (Bi ) and
the (known) conditional probabilities P (A|Bi ). It indicates that the unknown
a-posteriori probability is proportional to the product of the `
`
a-priori probability and the known conditional probability we wish to reverse; the constant
of proportionality is the reciprocal of the total probability of event A.
This result is the basis of an alternative approach to data analysis (discussed in Section 14.6 of Chapter 14) wherein available prior information is
incorporated in a systematic fashion into the analysis of experimental data.
3.5
Independence
For two events A and B, the conditional probability P (A|B) was dened
earlier in Eq.(5.33). In general, this conditional probability will be dierent
from the unconditional probability P (A), indicating that the knowledge that
B has occurred aects the probability of the occurrence of A.
However, when the occurrence of B has no eect on the occurrence of A,
then the events A and B are said to be independent and
P (A|B) = P (A)
(3.62)
so that the conditional and unconditional probabilities are identical. This will
occur when
P (A B)
= P (A)
(3.63)
P (B)
so that
P (A B) = P (A)P (B)
(3.64)
Thus, when events A and B are independent, the probability of the two events
happening concurrently is the product of the probabilities of each one occurring by itself. Note that the expression in Eq.(3.64) is symmetric in A and B
so that if A is independent of B, then B is also independent of A.
This is another in the collection of very important results used in the
78
Random Phenomena
P (A)P (B)
(3.65)
P (B C) =
P (A C) =
P (B)P (C)
P (A)P (C)
(3.66)
(3.67)
(3.68)
P (A B C) =
3.6
This chapter has been primarily concerned with assembling the machinery
of probability from the building blocks of events in the sample space,
the collection of all possible randomly varying outcomes of an experiment.
We have seen how the probability of an event A arises naturally from the
probability set function, an additive set function dened on the set that
satises the three axioms of Kolmogorov.
Having established the concept of probability and how the probability of
any subset of can be computed, a straightforward extension to special events
restricted to conditioning sets in led to the related concept of conditional
probability. The idea of total probability, the result known as Bayes rule, and
especially the concept of independence all arise naturally from conditional
probability and have profound consequences for random phenomena analysis
that cannot be fully appreciated until much later.
We note in closing that the presentation of probability in this chapter (especially as a tool for solving problems involving randomly varying phenomena)
is still quite rudimentary because the development is not quite complete yet.
The nal step in the development of the probability machinery, undertaken
primarily in the next chapter, requires the introduction of the random variable, X, from which the analysis tool, the probability distribution function,
f (x), emerges and is fully characterized.
Here are some of the main points of the chapter again:
Events, as subsets of the sample space, , can be elementary (simple) or
compound (complex); if elementary, then they are mutually exclusive; if
compound, then they can be composed from several simple events.
79
P (A B)
P (B)
REVIEW QUESTIONS
1. What are the ve basic building blocks of probability theory as presented in Section 3.1? Dene each one.
2. What is a simple (or elementary) event and how is it dierent from a complex
(or compound) event?
3. Why are elementary events mutually exclusive?
4. What is the relationship between events and the sample space?
5. In the language of events, what does the empty set, , represent? What does the
entire sample space, , represent?
80
Random Phenomena
6. Given two sets A and B, in the language of events, what do the following sets
represent: A ; A B; and A B?
7. What does it mean that two events A and B are mutually exclusive?
8. What is a set function in general and what is an additive set function in particular?
9. What are the three fundamental properties of a probability set function (also
known as Kolmogorovs axioms)?
10. How is the probability of any event A determined from the elements events
in ?
11. For any two sets A and B, what is the denition of P (A|B), the conditional
probability of A given B? If the two sets are disjoint such that A B = , in words,
what does P (A|B) mean in this case?
12. How does one obtain total probability from partial (i.e., conditional) probabilities?
13. What is Bayes rule and what is it used for?
14. Given P (A|Bi ) and P (Bi ), how does one reverse the probability to determine
P (Bi |A)?
15. What does it mean for two events A and B to be independent?
16. What is P (A B) when two events A and B are (i) mutually exclusive, and (ii)
independent?
EXERCISES
Section 3.1
3.1 When two diceone black with white dots, the other black with white dots
are tossed once, simultaneously, and the number of dots shown on each dies top
face after coming to rest are recorded as an ordered pair (nB , nW ), where nB is the
number on the black die, and nW the number on the white die,
(i) identify the experiment, what constitutes a trial, the outcomes, and the sample
space.
(ii) If the sum of the numbers on the two dice is S, i.e.,
S = nB + nW ,
(3.69)
81
disapprove, so that the outcome of one such opinion sample is the ordered triplet
(n0 , n1 , n2 ). Write mathematical expressions in terms of the numbers n0 , n1 , and n2
for the following events:
(i) A = {Unanimous support for the policy}; and A , the complement of A.
(ii) B = {More students disapprove than approve}; and B .
(iii) C = {More students are indierent than approve};
(iv) D = {The majority of students are indierent }.
Section 3.2
3.3 Given the following two sets A and B:
A
{x : x = 1, 3, 5, 7, . . .}
(3.70)
{x : x = 0, 2, 4, 6, . . .}
(3.71)
nd A B and A B.
3.4 Let Ak = {x : 1/(k + 1) x 1} for k = 1, 2, 3, . . .. Find the set B dened by:
B = A1 A2 A3 . . . =
Ai
(3.72)
i=1
3.5 For sets A, B, C, subsets of the universal set , establish the following identities:
(A B)
A B
(3.73)
A B
A (B C)
(A B) (A C)
(3.75)
A (B C)
(A B) (A C)
(3.76)
(A B)
(3.74)
3.6 For every pair of sets A, B, subsets of the sample space upon which the
probability set function P (.) has been dened, prove that:
P (A B) = P (A) + P (B) P (A B)
(3.77)
3.7 In a certain engineering research and development company, apart from the support sta which number 25, all other employees are either engineers or statisticians
or both. The total number of employees (including the support sta) is 100. Of
these, 50 are engineers, and 40 are statisticians; the number of employees that are
both engineers and statisticians is not given. Find the probability that an employee
chosen at random is not one of those classied as being both an engineer and a
statistician.
Section 3.3
3.8 For every set A, let the set function Q(.) be dened as follows:
f (x)
Q(A) =
(3.78)
where
x
1
2
f (x) =
; x = 0, 1, 2, . . .
3
3
(3.79)
82
Random Phenomena
evaluate P (A), P (A ), P (A A )
3.10 For the experiment of rolling two diceone black with white dots, the other
black with white dotsonce, simultaneously, presented in Exercise 3.1, rst obtain
, the sample space, and, by assigning equal probability to each of the outcomes,
determine the probability of the following events:
(i) A = {nB + nW = 7}, i.e. the sum is 7;
(ii) B = {nB < nW };
(iii) B , the complement of B;
(iv) C = {nB = nW }, i.e. the two dice show the same number;
(v) D = {nB + nW = 5 or 9}.
3.11 A black velvet bag contains three red balls and three green balls. Each experiment involves drawing two balls at once, simultaneously, and recording their colors,
R for red, and G for green.
(i) Obtain the sample space, assuming that balls of the same color are indistinguishable.
(ii) Upon assigning equal probability to each element in the sample space, determine
the probability of drawing two balls of dierent colors.
(iii) If the balls are distinguishable and numbered from 1 to 6, and if the two balls
are drawn sequentially, not simultaneously, now obtain the sample space and from
this determine the probability of drawing two balls of dierent colors.
3.12 An experiment is performed by selecting a card from an ordinary deck of 52
playing cards. The outcome, , is the type of card chosen, classied as: Ace,
King, Queen, Jack, and others. The random variable X() assigns the
number 4 to the outcome if is an Ace; X() = 3 if the outcome is a King;
X() = 2 if the outcome is a Queen, and X() = 1 if the outcome is a Jack;
X() = 0 for all other outcomes.
(i) What is the space V of this random variable?
(ii) If the probability set function P () dened on the subsets of the original sample
space assigns a probability 1/52 to each of these outcomes, describe the induced
probability set function PX (A) induced on all the subsets of the space V by this
random variable.
(iii) Describe a physical (scientic or engineering) problem for which the above would
be a good surrogate model.
3.13 Obtain the sample space, , for the experiment involving tossing a fair coin 4
times. Upon assigning equal probability to each outcome, determine the probabilities
of obtaining, 0, 1, 2, 3, or 4 heads. Conrm that your result is consistent with the
83
postulate that the probability model for this phenomenon is given by the probability
distribution function:
n!
f (x) =
(3.81)
px (1 p)nx
x!(n x)!
where f (x) is the probability of obtaining x heads in n = 4 tosses, and p = 12 is the
probability of obtaining a head in a single toss of the coin. (See Chapter 8.)
3.14 In the fall of 2007, k students born in 1989 attended an all-freshman introductory general engineering class at the University of Delaware. Conrm that if p is the
probability that at least two of the students have the same birthday then:
1p=
1
365!
(365 k)! (365)k
(3.82)
Show that for a class with 23 or more students born in 1989, the probability of at
least 2 students sharing the same birthday, is more than 1/2, i.e., if k > 23 then
p > 1/2.
Sections 3.4 and 3.5
3.15 Six simple events, with probabilities P (E1 ) = 0.11; P (E2 ) = P (E5 ) =
0.20; P (E3 ) = 0.25; P (E4 ) = 0.09; P (E6 ) = 0.15, constitute the entire set of outcomes of an experiment. The following events are of interest:
A = {E1 , E2 }; B = {E2 , E3 , E4 }; C = {E5 , E6 }; D = {E1 , E2 , E5 }
Determine the following probabilities:
(i) P (A), P (B), P (C), P (D);
(ii) P (A B), P (A B); P (A D), P (A D); P (B C), P (B C);
(iii) P (B|A), P (A|B); P (B|C), P (D|C)
Which of the events A, B, C and D are mutually exclusive?
3.16 Assuming that giving birth to a boy or a girl is equally likely, and further, that
no multiple births have occurred, rst, determine the probability of a family having
three boys in a row. Now consider the conjecture (based on empirical data) that, for
a family that has already had two boys in a row, the probability of having a third
boy is 0.8. Under these conditions, what is now the probability of a family having
three boys in a row?
3.17 As a follow-up to the concept of independence of two events A and B,
Event A is said to be attracted to event B if
P (A|B) > P (A)
(3.83)
(3.84)
(Of course, when P (A|B) = P (A), the two events have been previously identied
as independent.) Establish the result that if B attracts A, then: (i) A attracts B
84
Random Phenomena
APPLICATION PROBLEMS
3.23 Patients suering from manic depression and other similar disorders are sometimes treated with lithium, but the dosage must be monitored carefully because
lithium toxicity, which is often fatal, can be dicult to diagnose. A new assay used
to determine lithium concentration in blood samples is being promoted as a reliable
way to diagnose lithium toxicity because the assay result is purported to correlate
very strongly with toxicity.
A careful study of the relationship between this blood assay and lithium toxicity
in 150 patients yielded results summarized in Table 3.3. Here A+ indicates high
lithium concentrations in the blood assay and A indicates low lithium concentration; L+ indicates conrmed Lithium toxicity and L indicates no lithium toxicity.
(i) From these data, compute the following probabilities regarding the lithium toxicity status of a patient chosen at random::
85
TABLE 3.3:
Lithium toxicity
study results
Lithium Toxicity
Assay L+
L
Total
A+
30
17
47
A
21
82
103
Total
51
92
150
1. P (L+ ), the probability that the patient has lithium toxicity (regardless of the
blood assay result);
2. P (L+ |A+ ), the conditional probability that the patient has lithium toxicity
given that the blood assay result indicates high lithium concentration. What
does this value indicate about the potential benet of having this assay result
available?
3. P (L+ |A ) the conditional probability that the patient has lithium toxicity
given that the blood assay result indicates low lithium concentration. What
does this value indicate about the potential for missed diagnoses?
(ii) Compute the following probabilities regarding the blood lithium assay:
1. P (A+ ), the (total) probability of observing high lithium blood concentration
(regardless of actual lithium toxicity status);
2. P (A+ |L+ ) the conditional probability that the blood assay result indicates
high lithium concentration given that the patient indeed has lithium toxicity.
Why do you think that this quantity is referred to as the sensitivity of the
assay, and what does the computed value indicate about the sensitivity of
the particular assay in this study?
3. From information about P (L+ ) (as the prior probability of lithium toxicity)
along with the just computed values of P (A+ ) and P (A+ |L+ ) as the relevant
assay results, now use Bayes Rule to compute P (L+ |A+ ) as the posterior
probability of lithium toxicity after obtaining assay data, even though it has
already been computed directly in (i) above.
3.24 An experimental crystallizer produces ve dierent polymorphs of the same
crystal via mechanisms that are currently not well-understood. Types 1, 2 and 3
are approved for pharmaceutical application A; Types 2, 3 and 4 for a dierent
application B; Type 5 is mostly unstable and has no known application. How much
of each type is made in any batch varied randomly, but with the current operating
procedure, 30% of the total product made by the crystallizer in a month is of Type
1; 20% is of Type 2, with the same percentage of Types 3 and 4; and 10% is of Type
5. Assuming that the polymorhps can be separated without loss,
(i) Determine the probability of making product in a month that can be used for
application A;
(ii) Given a batch ready to be shipped for application B, what is the probabilities
that any crystal selected at random is of Type 2? What is the probability that it is
of Type 3 or Type 4. State any assumptions you may need to make.
86
Random Phenomena
(iii) What is the probability that an order change to one for application A can be
lled from a batch ready to be shipped for application B?
(iv) What is the converse probability that an order change to one for application B
can be lled given a batch that is ready to be shipped for application A?
3.25 A test for a relatively rare disease involves taking from the patient an appropriate tissue sample which is then assessed for abnormality. A few sources of error
are associated with this test. First, there is a small, but non-zero probability, s ,
that the tissue sampling procedure will miss abnormal cells primarily because these
cells (at least in the earlier stages) being relatively few in number, are randomly distributed in the tissue and tend not to cluster. In addition, during the examination
of the tissue sample itself, there is a probability, f , of failing to identify an abnormality when present; and a probability, m , of misclassifying a perfectly normal cell
as abnormal.
If the proportion of the population with this disease who are subjected to this
test is D ,
(i) In terms of the given parameters, determine the probability that the test result is
correct. (Hint: rst compute the probability that the test result is incorrect, keeping
in mind that the test may identify an abnormal cell incorrectly as normal, or a
normal cell as abnormal.)
(ii) Determine the probability of a false positive (i.e., returning an abnormality result
when none exists).
(iii) Determine the probability of a false negative (i.e., failing to identify an abnormality that is present).
3.26 Repeat Problem 3.25 for the specic values of s = 0.1; f = 0.05; m = 0.1
for a population in which 2% have the disease. A program sponsored by the Center
for Disease Control (CDC) is to be aimed at reducing the number of false positives
and/or false negatives by reducing one of the three probabilities s , f , and m .
Which of these parameters would you recommend and why?
3.27 A manufacturer of at-screen TVs purchases pre-cut glass sheets from three
dierent manufacturers, M1 , M2 and M3 , whose products are characterized in the
TV manufacturers incoming material quality control lab as premier grade, Q1 ,
acceptable grade, Q2 , and marginal grade, Q3 , on the basis of objective, measurable quality criteria, such as inclusions, warp, etc. Incoming glass sheets deemed
unacceptable are rejected and returned to the manufacturer. An incoming batch of
425 accepted sheets has been classied by an automatic classifying system as shown
in the table below.
Quality
Manufacturer
M1
M2
M3
Premier
Q1
110
150
76
Acceptable
Q2
25
33
13
Marginal
Q3
15
2
1
Total
150
185
90
87
(ii) Determine the probability that it is of premier grade given that it is from
manufacturer M1 ; also determine the probability that is of premier grade given
that it is from either manufacturer M2 or M3 .
(iii) Determine the probability that it is from manufacturer M3 given that it is of
marginal grade; also determine the probability that it is from manufacturer M2
given that it is of acceptable grade.
3.28 In a 1984 report2 , the IRS published the information shown in the following
table regarding 89.9 million federal tax returns it received, the income bracket of
the lers, and the percentage audited.
Income
Bracket
Below $10, 000
$10, 000 $24, 999
$25, 000 $49, 999
$50, 000 and above
Number of
lers (millions)
31.4
30.7
22.2
5.5
Percent
Audited
0.34
0.92
2.05
4.00
(i) Determine the probability that a tax ler selected at random from this population
would be audited.
(ii) Determine the probability that a tax ler selected at random is in the $25, 000
$49, 999 income bracket and was audited.
(iii) If we know that a tax ler selected at random was audited, determine the
probability that this person belongs in the $50, 000 and above income bracket.
2 Annual Report of Commissioner and Chief Counsel, Internal Revenue Service, U.S.
Department of Treasury, 1984, p 60.
88
Random Phenomena
Chapter 4
Random Variables and Distributions
4.1
4.2
4.3
4.4
4.5
4.6
90
90
93
94
95
95
98
100
102
102
104
107
107
113
115
116
119
119
122
122
123
124
124
126
129
133
Even though the machinery of probability as presented thus far can already be
used to solve some practical problems, its development is far from complete.
In particular, with a sample space of raw outcomes that can be anything from
attributes and numbers, to letters and other sundry objects, this most basic
form of probability will be quite tedious and inecient in dealing with general
random phenomena. This chapter and the next one are devoted to completing
the development of the machinery of probability with the introduction of the
concept of the random variable, from which arises the probability distribution functionan ecient mathematical form for representing the ensemble
behavior of general random phenomena. The emergence, properties and characteristics of the probability distribution function are discussed extensively in
89
90
Random Phenomena
this chapter for single dimensional random variables; the discussion is generalized to multi-dimensional random variables in the next chapter.
4.1
4.1.1
In general, the sample space presented thus far may be quite tedious
to describe and inecient to analyze mathematically if its elements are not
numbers. To facilitate mathematical analysis, it is desirable to nd a means
of converting this sample space into one with real numbers. This is achieved
via the vehicle of the random variable dened as follows:
Upon the introduction of this entity, X, the following happens (See Fig
4.1):
1. is mapped onto V , i.e.
V = {x : X() = x, }
(4.1)
so that V is the set of all values x generated from X() = x for all
elements in the sample space ;
2. The probability set function encountered before, P , dened on , gives
rise to another probability set function, PX , dened on V and induced
by X. PX is therefore often referred to as an induced probability set
function.
The role of PX in V is identical to that of P in . Thus, for any arbitrary
subsect A of V , PX (A) is the probability of event A occurring.
The primary question of practical importance may now be stated as follows: How does one nd PX (A) in the new setting created by the introduction
of the random variable X, given the original sample space , and the original
probability set function P dened on it?
The answer is to go back to what we know, i.e., to nd that set A
91
*A
Z
X(Z) = x
FIGURE 4.1: The original sample space, , and the corresponding space V induced
by the random variable X
which corresponds to the set of values of in that are mapped by X into
A, i.e.
(4.2)
A = { : and X() A}
Such a set A is called the pre-image of A, that set on the original sample
space from which A is obtained when X is applied on its elements (see Fig
4.1). We now simply dene
PX (A) = P (A )
(4.3)
P {X() A} = P { A }
(4.4)
since, by denition of A ,
from where we see how X induces PX (.) from the known P (.). It is easy to
show that the induced PX is an authentic probability set function in the spirit
of Kolmogorovs axioms.
Remarks:
1. The random variable is X; the value it takes is the real number x. The
one is a completely dierent entity from the other.
2. The expression P (X = x) will be used to indicate the probability that
the application of the random variable X results in an outcome with
assigned value x; or, more simply, the probability that the random
variable X takes on a particular value x. As such, X = x should
not be confused with the familiar arithmetic statement of equality or
equivalence.
3. In many instances, the starting point is the space V and not the tedious
sample space , with PX (.) already dened so that there is no further
need for reference to a P (.) dened on .
92
Random Phenomena
(4.5)
(4.6)
(4.7)
since these are all the possible values that X can take.
(2) To obtain PX (A), rst we nd A , the pre-image of A in . In this
case,
(4.8)
A = {5 , 6 , 7 }
so that upon recalling the probability set function P (.) generated in
Chapter 3 on the assumption of equiprobable outcomes, we obtain
P (A ) = 3/8, hence,
PX (A) = P (A ) = 3/8
(4.9)
The next two examples illustrate sample spaces that occur naturally in the
form of V .
Example 4.2 SAMPLE SPACE FOR SINGLE DIE TOSS EXPERIMENT
Consider an experiment in which a single die is thrown and the outcome
is the number that shows up on the dies top face when it comes to rest.
Obtain the sample space of all possible outcomes.
Solution:
The required sample space is the set {1, 2, 3, 4, 5, 6}, since this set
of numbers is an exhaustive collection of all the possible outcomes of
this experiment. Observe that this is a set of real numbers, so that it
is already in the form of V . We can therefore dene a probability set
function directly on it, with no further need to obtain a separate V and
an induced PX (.).
93
(4.10)
(4.11)
As an exercise, (see Exercise 4.7) the reader should compute the probability
PX (A) of the event A that X = 7, assuming equiprobable outcomes for each
die toss.
94
4.1.2
Random Phenomena
Practical Considerations
Rigor and precision are intrinsic to mathematics and mathematical analysis; without the former, the latter simply cannot exist. Such is the case with
the mathematical concept of the random variable as we have just presented
it: rigor demands that X be specied in this manner, as a function through
whose agency each element of the sample space of an experiment becomes
associated with an unambiguous numerical value. As illustrated in Fig 4.1, X
therefore appears as a mapping from one space, , that can contain all sorts
of raw objects, into one that is more conducive to mathematical analysis, V ,
containing only real numbers. Such a formal denition of the random variable
tends to appear sti, and almost sterile; and those encountering it for the rst
time may be unsure of what it really means in practice.
As a practical matter, the random variable may be considered (informally)
as an experimental outcome whose numerical value is subject to random variations with each exact replicate performance (trial) of the experiment. Thus,
for example, with the three coin-toss experiment discussed earlier, by specifying the outcome of interest as the total number of tails observed, we see
right away that the implied random variable can take on numerical values 0,
1, 2, or 3, even though the raw outcomes will consist of T s and Hs; also what
value the random variable takes is subject to random variation each time the
experiment is performed. In the same manner, we see that in attempting to
determine the temperature of an equilibrium mixture of ice and water, the observed temperature measurement in C takes on numerical values that vary
randomly around the number 0.
4.1.3
(4.12)
(4.13)
observe that the random variable space V in this case is given by:
V = {x : 0 x 1}.
(4.14)
95
(4.15)
(4.16)
Note that the two component random variables X1 and X2 are not
independent since their sum, X1 + X2 , by virtue of the experiment, is
constrained to equal 3 always.
What is noted briey here for two dimensions can be generalized to ndimensions, and the next chapter is devoted entirely to a discussion of multidimensional random variables.
4.2
4.2.1
Distributions
Discrete Random Variables
Let us return once more to Example 4.1 and, this time, for each element
of V , compute P (X = x), and denote this by f (x); i.e.
f (x) = P (X = x)
(4.17)
(4.18)
(4.19)
96
Random Phenomena
Likewise,
f (2) = P (X = 2) = 3/8
(4.20)
f (3) = P (X = 3) = 1/8
(4.21)
This function, f (x), indicates how the probabilities are distributed over the
entire random variable space.
Of importance also is a dierent, but related, function, F (x), dened as:
F (x) = P (X x)
(4.22)
the probability that the random variable X takes on values less than or equal
to x. For the specic example under consideration, we have: F (0) = P (X
0) = 1/8. As for F (1) = P (X 1), since the event A = {X 1} consists of
two mutually exclusive elementary events A0 = {X = 0} and A1 = {X = 1},
it then follows that:
F (1) = P (X 1) = P (X = 0) + P (X = 1) = 1/8 + 3/8 = 4/8
(4.23)
(4.25)
TABLE 4.1:
f (x) and F (x) for
the three coin-toss
experiments of
Example 4.1
x
f (x) F (x)
0
1/8
1/8
1
3/8
4/8
2
3/8
7/8
3
1/8
8/8
The function, f (x), is referred to as the probability distribution function
(pdf), or sometimes as the probability mass function; F (x) is known as the
cumulative distribution function, or sometimes simply as the distribution function.
Note, once again, that X can assume only a nite number of discrete
values, in this case, 0, 1, 2, or 3; it is therefore a discrete random variable,
and both f (x) and F (x) are discrete functions. As shown in Fig 4.2, f (x) is
characterized by non-zero spikes at values of x = 0, 1, 2 and 3, and F (x) by
the indicated staircase form.
97
1.0
0.35
0.8
0.30
F(x)
f(x)
0.6
0.25
0.4
0.20
0.2
0.15
0.0
0.10
0.0
0.5
1.0
1.5
x
2.0
2.5
3.0
2
x
FIGURE 4.2: Probability distribution function, f (x), and cumulative distribution function, F (x), for 3-coin toss experiment of Example 4.1
Let x0 = 0, x1 = 1, x2 = 2, x3 = 3; then
P (X = xi ) = f (xi ) for i = 0, 1, 2, 3
(4.26)
1/8;
3/8;
f (xi ) =
3/8;
1/8;
x0
x1
x2
x3
=0
=1
=2
=3
(4.27)
and the two functions in Table 4.1 are related explicitly according to the
following expression:
F (xi ) =
i
f (xj )
(4.28)
j=0
We may now also note the following about the function f (xi ):
f (xi ) > 0; xi
3
i=0
f (xi ) = 1
These ideas may now be generalized beyond the specic example used above.
98
Random Phenomena
Denition: Let there exist a sample space (along with a probability set function, P , dened on its subsets), and a random variable X, with an attendant random variable space V : a function f
dened on V such that:
1. f (x) 0; x V ;
x f (x) = 1; x V ;
3. PX (A) = xA f (x); for A V (and when A contains the
single element xi , PX (X = xi ) = f (xi ))
2.
4.2.2
Denition: The function f dened on the space V (whose elements consist of segments of the real line) such that:
1. f (x) 0; x V ;
2. f has at most a nite number of discontinuities in every nite
interval;
3. The (Riemann) integral, V f (x)dx = 1;
4. PX (A) = A f (x)dx; for A V
is called a probability density function of the continuous random
variable X.
(The second point above, unnecessary for the discrete case, is a mathematical ne point needed to safeguard against pathological situations where the
99
from where we may now observe that when F (x) possesses a derivative,
dF (x)
= f (x)
dx
(4.30)
This f (x) is the continuous counterpart of the discrete f (x) encountered earlier; but rather than express the probability that X takes on a particular point
value xi (as in the discrete case), the continuous f (x) expresses a measure of
the probability that X lies in the innitesimal interval between xi and xi + dx.
Observe, from item 4 in the denition given above, that:
P (xi X xi + dx) =
xi +dx
(4.31)
xi
=
=
P (x X) + P (x X x + dx)
F (x) + P (x X x + dx)
(4.33)
and therefore:
P (x X x + dx) = F (x + dx) F (x)
(4.34)
which, upon introducing Eq (4.31) for the LHS, dividing by dx, and taking
limits as dx 0, yields:
F (x + dx) F (x)
dF (x)
lim
= f (x)
(4.35)
=
dx0
dx
dx
establishing Eq (4.30).
In general, we can use Eq (4.29) to establish that, for any arbitrary b a,
P (a X b) =
(4.36)
100
Random Phenomena
For the sake of completeness, we note that F (x), the cumulative distribution function, is actually the more fundamental function for determining
probabilities. This is because, regardless of whether X is continuous or discrete, F (.) can be used to determine all desired probabilities. Observe from
the foregoing discussion that the expression
P (a1 < X a2 ) = F (a2 ) F (a1 )
(4.37)
4.2.3
We have now seen that the pdf f (x) (or equivalently, the cdf F (x)) is the
function that indicates how the probabilities of occurrence of various outcomes
and events arising from the random phenomenon in question are distributed
over the entire space of the associated random variable X.
Let us return once more to the three coin-toss example: we understand
that the random phenomenon in question is such that we cannot predict, `
apriori, the specic outcome of each experiment; but from the ensemble aggregate of all possible outcomes, we have been able to characterize, with f (x), the
behavior of an associated random variable of interest, X, the total number
of tails obtained in the experiment. (Note that other random variables could
also be dened for this experiment: for example, the total number of heads,
or the number of tosses until the appearance of the rst head, etc.) What
Table 4.1 provides is a complete description of the probability of occurrence
for the entire collection of all possible events associated with this random
variablea description that can now be used to analyze the particular random phenomenon of the total number of tails observed when a coin is tossed
three times.
For instance, the pdf f (x) indicates that, even though we cannot predict a
specic outcome precisely, we now know that after each experiment, observing
no tails (X = 0) is just as likely as observing all tails (X = 3), each with
a probability of 1/8. Also, observing two tails is just as likely as observing
one tail, each with a probability of 3/8, so that these latter group of events
are three times as likely as the former group of events. Note the symmetry of
the distribution of probabilities indicated by f (x) for this particular random
phenomenon.
It turns out that these specic results can be generalized for the class of
random phenomena to which the three coin-toss example belongs a class
characterized by the following features:
101
n!
px (1 p)nx ; x = 0, 1, 2, . . . , n
x!(n x)!
(4.38)
The results in Table 4.1 are obtained for the special case n = 3; p = 0.5.
Such functions as these provide convenient and compact mathematical
representations of the desired ensemble behavior of random variables; they
constitute the centerpiece of the probabilistic framework the fundamental
tool used for analyzing random phenomena.
We have, in fact, already encountered in earlier chapters, several actual
pdfs for some real-world random variables. For example, we had stated in
Chapter 1 (thus far without justication) that the continuous random variable
representing the yield obtained from the example manufacturing processes has
the pdf:
(x)2
1
f (x) = e 22 ; < x <
(4.39)
2
We are able to use this pdf to compute the probabilities of obtaining yields
in various intervals on the real line for the two contemplated processes, once
the parameters and are specied for each process.
We had also stated in Chapter 1 that, for the (discrete) random variable
X representing the number of inclusions found on the manufactured glass
sheet, the pdf is:
e x
; x = 0, 1, 2, . . .
(4.40)
f (x) =
x!
from which, again, given a specic value for the parameter , we are able to
compute the probabilities of nding any given number of inclusions on any
selected glass sheet. And in Chapter 2, we showed, using chemical engineering
principles, that the pdf for the (continuous) random variable X, representing
the residence time in an ideal CSTR, is given by:
f (x) =
1 x/
e
;0 < x <
(4.41)
102
Random Phenomena
These pdfs are all ideal models of the random variability associated with
each of the random variables in question; they make possible rigorous and
precise mathematical analyses of the ensemble behavior of the respective random phenomena. Such mathematical representations are systematically derived for actual, specic real-world phenomena of practical importance in Part
III, where the resulting pdfs are also discussed and analyzed extensively.
The rest of this chapter is devoted to taking a deeper look at the fundamental characteristics and general properties of the pdf, f (x), for singledimensional random variables; the next chapter is devoted to a parallel treatment for multi-dimensional random variables.
4.3
Mathematical Expectation
We begin our investigations into the fundamental characteristics of a random variable, X, and its pdf, f (x), with one of the most important: the
mathematical expectation or expected value. As will soon become clear,
the concept of expectations of random variables (or functions of random variables) is of signicant practical importance; but before giving a formal denition, we rst provide a motivation and an illustration of the concept.
4.3.1
Consider a game where each turn involves a player drawing a ball at random from a black velvet bag containing 9 balls, identical in every way except
that 5 are red, 3 are blue and one is green. The player receives $1.00 for
drawing a red ball, $4.00 for a blue ball, and $10.00 for the green ball, but
each turn at the game costs $4.00 to play. The question is: Is this game worth
playing?
The primary issue, of course, is the random variation in the color of the
drawn ball each time the game is played. Even though simple and somewhat
articial, this example provides a perfect illustration of how to solve problems
involving random phenomena using the probabilistic framework.
To arrive at a rational decision regarding whether to play this game or
not, we proceed as follows, noting rst the following characteristics of the
phenomenon in question:
Experiment : Draw a ball at random from a bag containing 9 balls composed as given above; note the color of the drawn ball, then replace the
ball;
Outcome: The color of the drawn ball: R = Red; B = Blue; G = Green.
Probabilistic Model Development
103
TABLE 4.2:
The pdf f (x) for
the ball-drawing
game
x
f (x)
1
5/9
4
3/9
10
1/9
From the problem denition, we see that the sample space is given by:
= {R, R, R, R, R, B, B, B, G}
(4.42)
The random variable, X, is clearly the monetary value assigned to the outcome
of each draw; i.e. in terms of the formal denition, X assigns the real number
1 to R, 4 to B, and 10 to G. (Informally, we could just as easily say that X is
the amount of money received upon each draw.) The random variable space
V is therefore given by:
V = {1, 4, 10}
(4.43)
And now, since there is no reason to think otherwise, we assume that each
outcome is equally probable, in which case the probability distribution for the
random variable X is obtained as follows:
PX (X = 1) = P (R) =
PX (X = 4) = P (B)
PX (X = 10) = P (G)
=
=
5/9
(4.44)
3/9
1/9
(4.45)
(4.46)
so that f (x), the pdf for this discrete random variable, is as shown in the
Table 4.2, or, mathematically as:
5/9; x1 = 1
3/9; x2 = 4
(4.47)
f (xi ) =
1/9; x3 = 10
0;
otherwise
This is an ideal model of the random phenomenon underlying this game; it
will now be used to analyze the problem and to decide rationally whether to
play the game or not.
Using the Model
We begin by observing that this is a case where it is possible to repeat the
experiment a large number of times; in fact, this is precisely what the person
setting up the game wants each player to do: play the game repeatedly! Thus,
if the game is played a very large number of times, say n, it is reasonable from
the model to expect 5n/9 red ball draws, 3n/9 blue ball draws, and n/9 green
104
Random Phenomena
ball draws; the corresponding nancial returns will be $(5n/9), $(4 3n/9),
and $(10 n/9), respectively, in each case.
Observe now that after n turns at the game, we would expect the total
nancial returns in dollars, say Rn , to be:
Rn =
3n
n
5n
+4
+ 10
1
= 3n
9
9
9
(4.48)
TABLE 4.3:
game
Ball
Color
Expected # of Financial
times drawn
returns
(after n trials) per draw
Red
5n/9
1
Blue
3n/9
4
n/9
10
Green
Total
Expected nancial
returns
(after n trials)
$5n/9
$12n/9
$10n/9
3n
In the meantime, the total cost Cn , the amount of money, in dollars, paid
out to play the game, would have been 4n. On the basis of these calculations,
therefore, the expected net gain (in dollars) after n trials, Gn , is given by
Gn = Rn Cn = n
(4.49)
indicating a net loss of $n, so that the rational decision is not to play the
game. (The house always wins!)
Eq (4.48) implies that the expected return per draw will be:
Rn
=
n
5
3
1
1 + 4 + 10
= 3,
9
9
9
(4.50)
Rn
=
xi f (xi )
n
i=1
(4.51)
4.3.2
105
(4.53)
(4.54)
(4.55)
(4.56)
(4.57)
106
Random Phenomena
indicating that with this experiment, the expected, or average, number
of tails per toss is 1.5, which makes perfect sense.
(2) The expected nancial return for the ball-draw game is obtained
formally from Eq (4.47) as:
E(X) = (1 5/9 + 4 3/9 + 10 1/9) = 3.0
(4.58)
0;
otherwise
(2) Find the expected value of the random variable, X, the residence
time in a CSTR, whose pdf f (x) is given in Eq (4.41).
Solution:
(1) First, we observe that Eq (4.59) is a legitimate pdf because
2
2
1
1
f (x)dx =
(4.60)
xdx = x2 = 1
4
0 2
0
and, by denition,
2
1 2 2
1
4
E(X) =
xf (x)dx =
x dx = x3 =
2 0
6
3
0
(2) In the case of the residence time,
1 x/
1 x/
xe
E(X) =
dx =
xe
dx
0
(4.61)
(4.62)
An important property of the mathematical expectation of a random variable X is that for any function of this random variable, say G(X),
for discrete X
i G(xi )f (xi );
E[G(X)] =
(4.64)
G(x)f
(x);
for
continuous
X
107
(4.65)
where c1 and c2 are constants, then from Eq (4.64) above, in the discrete case,
we have that:
(c1 xi + c2 )f (xi )
E(c1 X + c2 ) =
i
c1
xi f (xi ) + c2
f (xi )
c1 E(X) + c2
(4.66)
so that:
E(c1 X + c2 ) = c1 E(X) + c2
(4.67)
4.4
Characterizing Distributions
4.4.1
Moments of a Distributions
(4.68)
for any integer k. The expectation of this function is known as the k th (ordinary) moment of the random variable X (or, equivalently, the k th (ordinary)
moment of the pdf, f (x)), dened by:
mk = E[X k ]
(4.69)
(4.70)
Thus, the expected value of X, E(X), is also the same as the rst (ordinary)
108
Random Phenomena
(4.71)
for any constant value a and integer k. The expectation of this function is
known as the k th moment of the random variable X about the point a (or,
equivalently, the k th moment of the pdf, f (x), about the point a). Of particular
interest are the moments about the mean value , dened by:
k = E[(X )k ]
(4.72)
known as the central moments of the random variable X (or of the pdf, f (x)).
Observe from here that 0 = 1, and 1 = 0, always, regardless of X or ; these
therefore provide no particularly useful information regarding the characteristics of any particular X. However, provided that the conditions of absolute
convergence and absolute integrability hold, the higher central moments exist
and do in fact provide very useful information about the random variable X
and its distribution.
Second Central Moment: Variance
Observe from above that the quantity
2 = E[(X )2 ]
(4.73)
is the lowest central moment of the random variable X that contains any
meaningful information about the average deviation of a random variable
from its mean value. It is called the variance of X and is sometimes represented
as 2 (X). Thus,
Note that
(4.74)
(4.75)
(4.76)
(4.77)
(4.78)
109
(4.79)
(4.80)
provides a dimensionless measure of the relative amount of variability displayed by the random variable.
Third Central Moment: Skewness
The third central moment,
3 = E[(X )3 ]
(4.81)
known as the coecient of skewness, is often the more commonly used measure
precisely because it is dimensionless. For a perfectly symmetric distribution,
negative deviations from the mean exactly counterbalance positive deviations,
and both 3 and 3 vanish.
When there are more values of X to the left of the mean than to the
right, (i.e. when negative deviations from the mean dominate), 3 < 0 (as is
3 ), and the distribution is said to skew left or is negatively skewed. Such
distributions will have long left tails, as illustrated in Fig 4.3. An example
random variable with this characteristic is the gasoline-mileage (in miles per
gallon) of cars in the US. While many cars get relatively high gas-mileage,
there remains a few classes of cars (SUVs, Hummers, etc) with gas-mileage
much worse than the ensemble average. It is this latter class that contribute
to the long left tail.
On the other hand, when there are more values of X to the right of the
mean than to the left, so that positive deviations from the mean dominate,
both 3 and 3 are positive, and the distribution is said to skew right
or is positively skewed. As one would expect, such distributions will have
long right tails (see Fig 4.4). An example of this class of random variables
is the household income/net-worth in the US. While the vast majority of
household incomes/net-worth are moderate, the few truly super-rich whose
incomes/net-worth are a few orders of magnitude larger than the ensemble
110
Random Phenomena
3.0
2.5
f(x)
2.0
1.5
1.0
0.5
0.0
0.0
0.2
0.4
0.6
0.8
1.0
3.0
2.5
f(x)
2.0
1.5
1.0
0.5
0.0
0.0
0.2
0.4
0.6
0.8
1.0
111
0.4
f(x)
0.3
0.2
0.1
0.0
-3
-2
-1
0
X
FIGURE 4.5: Distributions with reference kurtosis (solid line) and mild kurtosis (dashed
line)
(4.83)
technically known as the coecient of kurtosis, that is simply called the kurtosis. Either quantity is a measure of how peaked or at a probability distribution is. A high kurtosis random variable has a distribution with a sharper
peak and thicker tails; the low kurtosis random variable on the other hand
has a distribution with a more rounded, atter peak, with broader shoulders.
For reasons discussed later, the value 4 = 3 is the accepted normal
reference for kurtosis, so that distributions for which 4 < 3 are said to be
platykurtic (mildly peaked) while those for which 4 > 3 are said to be leptokurtic (sharply peaked). Figures 4.5 and 4.6 show a reference distribution
with kurtosis 4 = 3, in the solid lines, compared to a distribution with mild
kurtosis (actually 4 = 1.8) (dashed line in Fig 4.5), and a distribution with
high kurtosis (dashed line in Fig 4.6).
Practical Applications
Of course, it is possible to compute as many moments (ordinary or central) of
112
Random Phenomena
0.4
f(x)
0.3
0.2
0.1
0.0
-10
-5
0
X
10
FIGURE 4.6: Distributions with reference kurtosis (solid line) and high kurtosis (dashed
line)
113
Finally, we note that moments of a random variable are not merely interesting theoretical characteristics; they have signicant practical applications.
For example, polymers, being macromolecules with non-uniform molecular
weights (because random events occurring during the manufacturing process
ensure that polymer molecules grow to varying sizes) are primarily characterized by their molecular weight distributions (MWDs). Not surprisingly, therefore, the performance of a polymeric material depends critically on its MWD:
for instance, with most elastomers, a narrow distribution (very low second
central moments) is associated with poor processing but superior mechanical
properties.
MWDs are so important in polymer chemistry and engineering that a wide
variety of analytical techniques have been developed for experimental determination of the MWD and the following special molecular weight averages
that are in common use:
1. Mn , the number average molecular weight, is the ratio of the rst (ordinary) moment to the zeroth ordinary moment. (In polymer applications,
the MWD, unlike a pdf f (x), is not normalized to sum or integrate to
1. The zeroth moment of the MWD is therefore not 1; it is the total
number of molecules present in the sample of interest.)
2. Mw , the weight average molecular weight, is the ratio of the second
moment to the rst moment; and
3. Mz , the so-called z average molecular weight, is the ratio of the third
moment to the second.
One other important practical characteristic of the polymeric material is its
polydispersity index, PDI, the ratio of Mw to Mn . A measure of the breadth
of the MWD, it is always > 1 and approximately 2 for most linear polymers;
for highly branched polymers, it can be as high as 20 or even higher.
What is true of polymers is also true of particulate products such as granulated sugar, or fertilizer granules sold in bags. These products are made up
of particles with non-uniform sizes and are characterized by their particle size
distributions. The behavior of these products, whether it is their ow characteristics, or how they dissolve in solution, are determined by the moments of
these distributions.
4.4.2
ie
MX (t) =
tx
e f (x)dx; for continuous X
(4.85)
(4.86)
114
Random Phenomena
(4.88)
(4.89)
so that, for t = 0,
d tX
E Xe
= E X 2 etX
dt
MX
(0) = E X 2 = m2
(4.90)
(4.91)
MX (0) = E[X n ] = mn
(4.92)
X2 2 X3 3
t +
t +
2
3!
(4.93)
Clearly, this innite series converges only under certain conditions. For those
random variables, X, for which the series does not converge, MX (t) does not
exist; but when it exists, this series converges, and by repeated dierentiation
of Eq (4.93) with respect to t, followed by taking expectations, we are then
able to establish the result in Eq (4.92).
The following are some important properties of the MGF.
1. Uniqueness: The MGF, MX (t), does not exist for all random variables,
X; but when it exists, it uniquely determines the distribution, so that if
two random variables have the same MGF, they have the same distribution. Conversely, random variables with dierent MGFs have dierent
distributions.
115
(4.94)
(4.95)
(4.98)
(4.99)
4.4.3
Characteristic Function
As alluded to above, the MGF does not exist for all random variables, a
fact that sometimes limits its usefulness. However, a similarly dened function,
the characteristic function, shares all the properties of the MGF but does not
suer from this primary limitation: it exists for all random variables.
When G(X) in Eq (4.64) is given as:
G(X) = ejtX
(4.100)
where j is the complex variable (1), then the function of the real-valued
variable t dened as,
(4.101)
X (t) = E ejtX
116
Random Phenomena
i.e.
X (t) =
jtx
i
f (xi );
ie
for discrete X;
for continuous X
ejtx f (x)dx;
(4.102)
(4.103)
jtX
e = cos2 (tX) + sin2 (tX) = 1
(4.104)
jtX
= 1 < , always, regardless of X, with the direct impliso that E e
cation that X (t) = E(ejtX ) always exists for all random variables. Thus,
anything one would have typically used the MGF for (e.g., for deriving limit
theorems in advanced courses in probability), one can always substitute the
CF when the MGF does not exist.
The reader familiar with Laplace transforms and Fourier transforms will
probably have noticed the similarities between the former and the MGF (see
Eq (4.86)), and between the latter and the CF (see Eq (4.102)). Furthermore,
the relationship between these two probability functions are also reminiscent
of the relationship between the two transforms: not all functions have Laplace
transforms; the Fourier transform, on the other hand, does not suer such
limitations.
We now state, without proof, that given the expression for the characteristic function in Eq (4.102), there is a corresponding inversion formula whereby
f (x) is recovered from X (t), given as follows:
b jtx
1
e
X (t)dt; for discrete X;
limb 2b
b
(4.105)
f (x) =
1 jtx
e
(t)dt;
for
continuous
X
X
2
In fact, the two sets of equations, Eqs (4.102) and (4.105), are formal Fourier
transform pairs precisely as in other engineering applications of the theory of
Fourier transforms. These transform pairs are extremely useful in obtaining
the pdfs of functions of random variables, most especially sums of random
variables. As with classic engineering applications of the Fourier (and Laplace)
transform, the characteristic functions of the functions of independent random
variables in question are obtained rst, being easier to obtain directly than
the pdfs; the inversion formula is subsequently invoked to recover the desired
pdfs. This strategy is employed at appropriate places in upcoming chapters.
4.4.4
Apart from the mean, variance and other higher moments noted above,
there are other characteristic attributes of importance.
117
x* = 1 (Mode)
f(x)
0.3
0.2
0.1
0.0
4
X
(4.108)
(4.109)
xm
(For the discrete random variable, replace the integral above with appropriate
118
Random Phenomena
100
F(x); Percent
80
75
60
50
40
25
2.140
1.58
1.020
20
FIGURE 4.8: The cdf of a continuous random variable X showing the lower and upper
quartiles and the median
sums.) Observe therefore that the median, xm , divides the total range of the
random variable into two parts with equal probability.
For a symmetric unimodal distribution, the mean, mode and median coincide; they are dierent for asymmetric (skewed) distributions.
Quartiles
The concept of a median, which divides the cdf at the 50% point, can be
extended to other values indicative of other fractional sectioning o of the
cdf. Thus, by referring to the median as x0.5 , or x50 , we are able to dene, in
the same spirit, the following values of the random variable, x0.25 and x0.75
(or, in terms of percentages, x25 and x75 respectively) as follows:
F (x0.25 ) = 0.25
(4.110)
(4.111)
the value of X below which lies three quarters of the population. These values
are known respectively as the lower and upper quartiles of the distribution
because, along with the median x0.5 , these values divide the population into
four quarters, each part with equal probability.
These concepts are illustrated in Fig 4.8 where the lower quartile is located
at x = 1.02; the median at x = 1.58 and the upper quartile at x = 2.14. Thus,
for this particular example, P (X < 1.02 = 0.25); P (1.02 < X < 1.50) =
0.25; P (1.58 < X < 2.14) = 0.25 and P (X > 1.58) = 0.25.
119
There is nothing restricting us to dividing the population in halves (median) or in quarters (quartiles); in general, for any 0 < q < 1, the q th quantile
is dened as that value xq of the random variable for which
xq
F (xq ) =
f (x)dx = q
(4.112)
for a continuous random variable (with the integral replaced by the appropriate sum for the discrete random variable).
This quantity is sometimes dened instead in terms of percentiles, in which
case, the q th quantile is simply the 100q percentile. Thus, the median is equivalently the half quantile, the 50th percentile, or the second quartile.
4.4.5
Entropy
(4.113)
(4.114)
(4.115)
known as the entropy of the random variable, or, its mean information content.
Chapter 10 explores how to use the concept of information and entropy to
develop appropriate probability models for practical problems in science and
engineering.
4.4.6
Probability Bounds
We now know that the pdf f (x) of a random variable contains all the
information about it to enable us compute the probabilities of occurrence of
various outcomes of interest. As valuable as this is, there are times when all
we need are bounds on probabilities, not exact values. We now discuss some
of the most important results regarding bounds on probabilities that can be
determined for any general random variable, X without specic reference to
120
Random Phenomena
any particular pdf. These results are very useful in analyzing the behavior of
random phenomena and have practical implications in determining values of
unknown population parameters.
We begin with a general lemma from which we then derive two important
results.
E[G(X)]
c
(4.116)
There are several dierent ways of proving this result; one of the most
direct is shown below.
Proof : By denition,
E[G(X)] =
G(x)f (x)dx
(4.117)
If we now divide the real line < x < into two mutually
exclusive regions, A = {x : G(x) c} and B = {x : G(x) < c}, i.e.
A is that region on the real line where G(x) c, and B is what is
left, then, Eq (4.117) becomes:
G(x)f (x)dx +
G(x)f (x)dx
(4.118)
E[G(X)] =
A
where the last inequality arises because, for all x A, (the region
over which we are integrating) G(x) c, with the net results that:
E[G(X)] cP (G(X) c)
(4.120)
121
This remarkable result holds for all random variables, X, and for any nonnegative functions of the random variable, G(X). Two specic cases of G(X)
give rise to results of special interest.
Markovs Inequality
When G(X) = X, Eq (4.116) immediately becomes:
P (X c)
E(X)
c
(4.122)
a result known as Markovs inequality. It allows us to place bounds on probabilities when only the mean value of a random variable is known. For example,
if the average number of inclusions on glass sheets manufactured in a specic
site is known to be 2, then according to Markovs inequality, the probability
of nding a glass sheet containing 5 or more inclusions at this manufacturing
site can never exceed 2/5. Thus if glass sheets containing 5 or more inclusions
are considered unsaleable, without reference to any specic probability model
of the random phenomenon in question, the plant manager concerned about
making unsaleable product can, by appealing to Markovs inequality, be sure
that things will never be worse than 2 in 5 unsaleable products.
It is truly remarkable, of course, that such statements can be made at all;
but in fact, this inequality is actually quite conservative. As one would expect,
with an appropriate probability model, one can be even more precise. (Table
2.1 in Chapter 2 in fact shows that the actual probability of obtaining 5 or
more inclusions on glass sheets manufactured at this site is 0.053, nowhere
close to the upper limit of 0.4 given by Markovs inequality.)
Chebychevs Inequality
Now let G(X) = (X )2 , and c = k 2 2 , where is the mean value of X,
and 2 is the variance, i.e. 2 = E[(x )2 ]. In this case, Eq (4.116) becomes
P [(X )2 k 2 2 ]
1
k2
(4.123)
1
,
k2
(4.124)
122
Random Phenomena
2
2
=
9
9
(4.125)
4.5
4.5.1
Survival Function
The survival function, S(x), is the probability that the random variable
X exceeds the specic value x; in lifetime applications, this translates to the
probability that the object of study survives beyond the value x, i.e.
S(x) = P (X > x)
(4.126)
(4.127)
123
Find the survival function S(x), for the random variable, X, the residence time in a CSTR, whose pdf is given in Eq (4.41). This function
directly provides the probability that any particular dye molecule survives in the CSTR beyond a time x.
Solution:
Observe rst that this random variable is continuous and non-negative
so that the desired S(x) does in fact exist. The required S(x) is given
by
1 x/
e
dx = ex/
(4.128)
S(x) =
x
We could equally well have arrived at the result by noting that the cdf
F (x) for this random variable is given by:
F (x) = (1 ex/ ).
(4.129)
4.5.2
Hazard Function
f (x)
f (x)
=
S(x)
1 F (x)
(4.130)
provides just such a function. It does for future failure what f (x) does for
lifetimes in general. Recall that by denition, because X is continuous, f (x)
provides the (unconditional) probability of a lifetime in the innitesimal interval {xi < X < xi + dx} as f (xi )dx; in the same manner, the probability of
failure occurring in that same interval, given that the object of study survived
until the beginning of the current time interval, xi , is given by h(xi )dx. In
general
P (x < X < x + dx)
f (x)dx
=
(4.131)
h(x)dx =
S(x)
P (X > x)
so that, from the denition of conditional probability given in Chapter 3,
h(x)dx is seen as equivalent to P (x < X < x + dx|X > x). h(x) is therefore
sometimes referred to as the death rate of failure rate at x of those surviving until x (i.e. of those at risk at x); it describes how the risk of failure
changes with age.
Example 4.9 HAZARD FUNCTION OF A CONTINUOUS
RANDOM VARIABLE
124
Random Phenomena
Find the hazard function h(x), for the random variable, X, the residence
time in a CSTR.
Solution:
From the given pdf and the survival function obtained in Example 4.8
above, the required function h(x) is given by,
h(x) =
1 x/
e
ex/
(4.132)
4.5.3
Analogous to the cdf, F (x), the cumulative hazard function, H(x), is dened
as:
x
H(x) =
h(u)du
(4.133)
0
It can be shown that H(x) is related to the more well-known F (x) according
to
(4.134)
F (x) = 1 eH(x)
and that the relationship between S(x) and H(x) is given by:
S(x) = eH(x)
(4.135)
H(x) = log[S(x)]
(4.136)
or, conversely,
4.6
We are now in a position to look back at this chapter and observe, with
some perspective, how the introduction of the seemingly innocuous random
variable, X, has profoundly aected the analysis of randomly varying phenomena in a manner analogous to how the introduction of the unknown
quantity, x, transformed algebra and the solution of algebraic problems. We
have seen how the random variable, X, maps the sometimes awkward and
125
tedious sample space, , into a space of real numbers; how this in turn leads
to the emergence of f (x), the probability distribution function (pdf); and
how f (x) has essentially supplanted and replaced the probability set function,
P (A), the probability analysis tool in place at the end of Chapter 3.
The full signicance of the role of f (x) in random phenomena analysis may
not be completely obvious now, but it will become more so as we progress in
our studies. So far, we have used it to characterize the random variable in
terms of its mathematical expectation, and the expectation of various other
functions of the random variable. And this has led, among other things, to our
rst encounter with the mean, variance, skewness and kurtosis, of a random
variable, important descriptors of data that we are sure to encounter again
later (in Chapter 12 and beyond).
Despite initial appearances, every single topic discussed in this chapter
nds useful application in later chapters. In the meantime, we have taken
pains to try and breathe some practical life into many of these typically dry
and formal denitions and mathematical functions. But if some, especially
the moment generating function, the characteristic function, and entropy, still
appear to be of dubious practical consequence, such lingering doubts will be
dispelled completely by Chapters 6, 8, 9 and 10. Similarly, the probability
bounds (especially Chebyshevs inequality) will be employed in Chapter 8,
and the special functions of Section 4.5 will be used extensively in their more
natural setting in Chapter 23.
The task of building an ecient machinery for random phenomena analysis, which began in Chapter 3, is now almost complete. But before the generic
pdf, f (x), introduced and characterized in this chapter begins to take on
specic, distinct personalities for various random phenomena, some residual
issues remain to be addressed in order to complete the development of the
probability machinery. Specically, the discussion in this chapter will be extended to higher dimensions in Chapter 5, and the characteristics of functions
of random variables will be explored in Chapter 6. Chapter 7 is devoted to
two application case studies that put the complete set of discussions in Part
II in perspective.
Here are some of the main points of the chapter again.
Formally, the random variable, Xdiscrete or continuousassigns to
each element , one and only one real number, X() = x, thereby
mapping onto a new space, V ; informally it is an experimental outcome whose numerical value is subject to random variations with each
exact replicate trial of the experiment.
The introduction of the random variable, X, leads directly to the emergence of f (x), the probability distribution function; it represents how the
probabilities of occurrence of all the possible outcomes of the random
experiment of interest are distributed over the entire random variable
space, and is a direct extension of P (A).
126
Random Phenomena
dF (x)
dx
= f (x).
It exists only when i |xi |f (xi ) < (absolute convergence for discrete
127
REVIEW QUESTIONS
1. Why is the raw sample space, , often tedious to describe and inecient to analyze mathematically?
2. Through what means is the general sample space converted into a space with real
numbers?
3. Formally, what is a random variable?
4. What two mathematical transformations occur as a consequence of the formal
introduction of the random variable, X?
5. How is the induced probability set function, PX , related to the probability set
function, P , dened on ?
6. What is the pre-image, A , of the set A?
7. What is the relationship between the random variable, X, and the associated real
number, x? What does the expression, P (X = x) indicate?
8. When does the sample space, , naturally occur in the form of the random variable space, V ?
9. Informally, what is a random variable?
10. What is the dierence between a discrete random variable and a continuous one?
11. What is the pdf, f (x), and what does it represent for the random variable, X?
12. What is the relationship between the pdf, f (xi ), and the cdf, F (xi ), for a discrete random variable, X?
13. What is the relationship between the pdf, f (x), and the cdf, F (x), for a continuous random variable, X?
14. Dene mathematically the expected value, E(X), for a discrete random variable
and for a continuous one.
15. What conditions must be satised for E(X) to exist?
16. Is E(X) a random variable and does it have units?
17. What is the relationship between the expected value, E(X), and the mean value,
of a random variable (or equivalently, of its distribution)?
18. Distinguish between ordinary moments and central moments of a random variable.
128
Random Phenomena
19. What are the common names by which the second, third and fourth central
moments of a random variable are known?
20. What is Cv , the coecient of variation of a random variable?
21. What is the distinguishing characteristic of a skewed distribution (positive or
negative)?
22. Give an example each of a negatively skewed and a positively skewed randomly
varying phenomenon.
23. What do the mean, variance, skewness, and kurtosis tell us about the distribution of the random variable in question?
24. What do Mn , Mw , and Mz represent for a polymer material?
25. What is the polydispersity index of a polymer and what does it indicate about
the molecular weight distribution?
26. Dene the moment generating function (MGF) of a random variable, X. Why
is it called by this name?
27. What is the uniqueness property of the MGF?
28. Dene the characteristic function of a random variable, X. What distinguishes
it from the MGF?
29. How are the MGF and characteristic function (CF) of a random variable related
to the Laplace and Fourier transforms?
30. Dene the mode, median, quartiles and percentiles of a random variable.
31. Within the context of this chapter, what is Entropy?
32. Dene Markovs inequality. It allows us to place probability bounds when what
is known about the random variable?
33. Dene Chebychevs inequality.
34. Which probability bound is sharper, the one provided by Markovs inequality
or the one provided by Chebychevs?
35. What are the dening characteristics of those random variables for which the special probability functions, the survival and hazard functions, are applicable? These
functions are used predominantly in studying what types of phenomena?
36. Dene the survival function, S(x). How is it related to the cdf, F (x)?
129
37. Dene the hazard function, h(x). How is it related to the pdf, f (x)?
38. Dene the cumulative hazard function, H(x). How is it related to the cdf, F (x),
and the survival function, S(x)?
EXERCISES
Section 4.1
4.1 Consider a family that plans to have a total of three children; assuming that
they will not have any twins, generate the sample space, , for the possible outcomes. By dening the random variable, X as the total number of female children
born to this family, obtain the corresponding random variable space, V . Given that
this particular family is genetically predisposed to having boys, with a probability,
p = 0.75 of giving birth to a boy, obtain the probability that this family will have
three boys and compare it to the probability of having other combinations.
4.2 Revisit Example 4.1 in the text, and this time, instead of tossing a coin three
times, it is tossed 4 times. Generate the sample space, ; and using the same denition of X as the total number of tails, obtain the random variable space, V , and
compute anew the probability of A, the event that X = 2.
4.3 Given the spaces and V for the double dice toss experiment in Example 4.3
in the text,
(i) Compute the probability of the event A that X = 7;
(ii) If B is the event that X = 6, and C the event that X = 10 or X = 11, compute
P (B) and P (C).
Section 4.2
4.4 Revisit Example 4.3 in the text on the double dice toss experiment and obtain
the complete pdf f (x) for the entire random variable space. Also obtain the cdf,
F (x). Plot both distribution functions.
4.5 Given the following probability distribution function for a discrete random variable, X,
x
f (x)
1
0.10
2
0.25
3
0.30
4
0.25
5
0.10
x k
n
; x = 1, 2, . . . , n
(4.137)
130
Random Phenomena
(4.138)
(i) First obtain the value of the constant, c, required for this to be a legitimate pdf,
and then obtain an expression for the cdf F (x).
(ii) Obtain P (X 1/2) and P (X 1/2).
(iii) Obtain the value xm such that
P (X xm ) = P (X xm )
(4.139)
4.8 From the distribution of residence times in an ideal CSTR is given in Eq (4.41),
determine, for a reactor with average residence time, = 30 mins, the probability
that a reactant molecule (i) spends less than 30 mins in the reactor; (ii) spends more
than 30 mins in the reactor; (iii) spends less than (30 ln 2) mins in the reactor; and
(iv) spends more than (30 ln 2) mins in the reactor.
Section 4.3
4.9 Determine E(X) for the discrete random variable in Exercise 4.5; for the continuous random variable in Exercise 4.6; and establish that E(X) for the residence
time distribution in Eq (4.41) is , thereby justifying why this parameter is known
as the mean residence time.
4.10 (Adapted from Stirzaker, 20031 ) Show that E(X) exists for the discrete random
variable, X, with the pdf:
f (x) =
4
; x = 1, 2, . . .
x(x + 1)(x + 2)
(4.140)
while E(X) does not exist for the discrete random random variable with the pdf
f (x) =
1
; x = 1, 2, . . .
x(x + 1)
(4.141)
4.11 Establish that E(X) = 1/p for a random variable X whose pdf is
f (x) = p(1 p)x1 ; x = 1, 2, 3, . . .
(4.142)
p(1 p)x1 = 1
(4.143)
x=1
4.12 From the denition of the mathematical expectation function, E(.), establish
that for the random variable, X, discrete or continuous:
E[k1 g1 (X) + k2 g2 (X)] = k1 E[g1 (X)] + k2 E[g2 (X)],
(4.144)
(4.145)
131
E(X) E(Y )
i.e., Z
X Y
(4.147)
and that
V ar(Z) = V ar(X) + V ar(Y )
(4.148)
when E[(X X )(Y Y )] = 0 (i.e., when X and Y are independent: see Chapter 5).
4.14 Given that the pdf of a certain discrete random variable X is:
f (x) =
x e
; x = 0, 1, 2, . . .
x!
(4.149)
f (x)
(4.150)
E(X)
V ar(X)
(4.151)
(4.152)
x=0
4.15 Obtain the variance and skewness of the discrete random variable in Exercise
4.5 and for the continuous random variable in Exercise 4.6. Which random variables
distribution is skewed and which is symmetric?
4.16 From the formal denitions of the moment generating function, establish Eqns
(4.95) and (4.96).
4.17 Given the pdf for the residence time for two identical CSTRs in series as
f (x) =
1 x/
xe
2
(4.153)
(i) obtain the MGF for this pdf and compare it with that derived in Example 4.7 in
the text. From this comparison, what would you conjecture to be the MGF for the
distribution of residence times for n identical CSTRs in series?
(ii) Obtain the characteristic function for the pdf in Eq (4.41) for the single CSTR
and also for the pdf in Eq (4.153) for two CSTRs. Compare the two characteristic
functions and conjecture what the corresponding characteristic function will be for
the distribution of residence times for n identical CSTRs in series.
4.18 Given that M (t) is the moment generating function of a random variable,
dene the psi-function, (t), as:
(t) = ln M (t)
(4.154)
132
Random Phenomena
(i) Prove that (0) = , and (0) = 2 , where each prime
indicates dierentiation
with respect to t; and E(X) = , is the mean of the random variable, and 2 is the
variance, dened by 2 = V ar(X) = E[(X )2 ].
(ii) Given the pdf of a discrete random variable X as:
f (x) =
x e
; x = 0, 1, 2, . . .
x!
obtain its (t) function and show, using the results in (i) above, that the mean and
variance of this pdf are identical.
4.19 The pdf for the yield data discussed in Chapter 1 was postulated as
f (y) =
(y)2
1
e 22 ; < y <
2
(4.155)
If we are given that is the mean, rst establish that the mode is also , and then
use the fact that the distribution is perfectly symmetric about to establish that
median is also , hence conrming that for this distribution, the mean, mode and
median coincide.
4.20 Given the pdf:
1 1
; < x <
(4.156)
1 + x2
nd the mode and the median and show that they coincide. For extra credit:
Establish that = E(X) does not exist.
f (x) =
4.21 Compute the median and the other quartiles for the random variable whose
pdf is given as:
x 0<x<2
f (x) =
(4.157)
0 otherwise
4.22 Given the binary random variable, X, that takes the value 1 with probability
p, and the value 0 with probability (1 p), so that its pdf is given by
1 p x = 0;
p
x = 1;
(4.158)
f (x) =
0
elsewhere.
obtain an expression for the entropy H(X) and show that it is maximized when
p = 0.5, taking on the value H (X) = 1 at this point.
Section 4.5
4.23 First show that the cumulative hazard function, H(x), for the random variable,
X, the residence time in a CSTR is the linear function,
H(x) = x
(4.159)
133
(4.161)
and from here obtain the pdf, f (y), for this random variable.
4.24 Given the pdf for the residence time for two identical CSTRs in series in Exercise 4.17, Eq (4.153), determine the survival function, S(x), and the hazard function,
h(x). Compare them to the corresponding results obtained for the single CSTR in
Example 4.8 and Example 4.9 in the text.
APPLICATION PROBLEMS
4.25 Before an automobile parts manufacturer takes full delivery of polymer resins
made by a supplier in a reactive extrusion process, a sample is processed and the
performance is tested for Toughness. The batch is either accepted (if the processed
samples Toughness equals or exceeds 140 J/m3 ) or it is rejected. As a result of
process and raw material variability, the acceptance/rejection status of each batch
varies randomly. If the supplier sends four batches weekly to the parts manufacturer, and each batch is made independently on the extrusion process, so that the
ultimate fate of one batch is independent of the fate of any other batch, dene X
as the random variable representing the number of acceptable batches a week and
answer the following questions:
(i) Obtain the sample space, , and the corresponding random variable space, V .
(ii) First, assume equal probability of acceptance and rejection, and obtain the the
pdf, f (x), for the entire sample space. If, for long term protability it is necessary
that at least 3 batches be acceptable per week, what is the probability that the
supplier will remain protable?
4.26 Revisit Problem 4.25 above and consider that after an extensive process and
control system improvement project, the probability of acceptance of a single batch
is improved to 0.8; obtain the new pdf, f (x). If the revenue from a single acceptable
batch is $20,000, but every rejected batch costs the supplier $8,000 in retrieval and
incineration fees, which will be deducted from the revenue, what is the expected net
revenue per week under the current circumstances?
4.27 A gas station situated on a back country road has only one gasoline pump
and one attendant and, on average, receives = 3 (cars/hour). The average rate at
which this lone attendant services the cars is (cars/hour). It can be shown that
the total number of cars at this gas station at any time (i.e. the one currently being
served, and those waiting in line to be served) is the random variable X with the
following pdf:
x
; x = 0, 1, 2, . . .
(4.162)
f (x) = 1
(i) Show that so long as < , the probability that the line at the gas station is
innitely long is zero.
(ii) Find the value of required so that the expected value of the total number of
134
Random Phenomena
Percent of Population
with income level, x
4
13
17
20
16
12
7
4
3
2
1
1
1 x/
e
;x > 0
(4.163)
135
no more than 15% would have to be replaced before the warranty period expires.
Find xw .
(ii) In planning for the second generation toaster, design engineers wish to set a target value to aim for ( = 2 ) such that 85% of the second generation chips survive
beyond 3 years. Determine 2 and interpret your results in terms of the implied
fold increase in mean life-span from the rst to the second generation of chips.
4.30 The probability of a single transferred embryo resulting in a live birth in an
in-vitro fertilization treatment, p, is given as 0.5 for a younger patient and 0.2 for
an older patient. When n = 5 embryos are transferred in a single treatment, it is
also known that if X is the total number of live births resulting from this treatment,
then E(X) = 2.5 for the younger patient and E(X) = 1 for the older patient, and
the associated variance, V ar(X) = 1.25 for the younger and V ar(X) = 0.8 for the
older.
(i) Use Markovs inequality and Chebyshevs inequality to obtain bounds on the
probability of each patient giving birth to quadruplets or a quintuplets at the end
of the treatment.
(ii) These bounds are known to be quite conservative, but to determine just how
conservative , compute the actual probabilities of the stated events for each patient
given that an appropriate pdf for X is
f (x) =
5!
px (1 p)5x
x!(5 x)!
(4.164)
where p is as given above. Compare the actual probabilities with the Markov and
Chebychev bounds and identify which bound is sharper.
4.31 The following data table, obtained from the United States Life Tables 196971,
(published in 1973 by the National Center for Health Statistics) shows the probability of survival until the age of 65 for individuals of the given age2 .
Age
y
0
10
20
30
35
40
45
50
55
60
Prob of survival
to age 65
0.72
0.74
0.74
0.75
0.76
0.77
0.79
0.81
0.85
0.90
The data should be interpreted as follows: the probability that all newborns, and
children up to the age of ten survive until 65 years of age is 0.72; for those older
than 10 and up to 20 years, the probability of survival to 65 years is 0.74, and so on.
Assuming that the data is still valid in 1975, a community cooperative wishes to
2 More up-to-date versions, available, for example, in National Vital Statistics Reports,
Vol. 56, No. 9, December 28, 2007 contain far more detailed information.
136
Random Phenomena
set up a life insurance program that year whereby each participant pays a relatively
small annual premium, $, and, in the event of death before 65 years, a one-time
death gratuity payment of $ is made to the participants designated beneciary.
If the participant survives beyond 65 years, nothing is paid. If the cooperative is
to realize a xed, modest expected revenue, $RE = $30, per year, per participant,
over the duration of his/her participation (mostly to cover administrative and other
costs) provide answers to the following questions:
(i) For a policy based on a xed annual premium of $90 for all participants, and
age-dependent payout, determine values for (y), the published payout for a person
of age y that dies before age 65, for all values of y listed in this table.
(ii) For a policy based instead on a xed death payout of $8, 000, and age-dependent
annual premiums, determine values for (y), the published annual premium to be
collected each year from a participant of age y.
(iii) If it becomes necessary to increase the expected revenue by 50% as a result
of increased administrative and overhead costs, determine the eect on each of the
policies in (i) and (ii) above.
(iv) If by 1990, the probabilities of survival have increased across the board by 0.05,
determine the eect on each of the policies in (i) and (ii).
Chapter 5
Multidimensional Random Variables
5.1
5.2
5.3
5.4
137
138
139
140
141
141
144
147
152
153
154
155
156
157
158
163
164
166
168
When the outcome of interest in an experiment is not one, but two or more
variables simultaneously, additional issues arise that are not fully addressed
by the probability machinery as it stands at the end of the last chapter. The
concept of the random variable, restricted as it currently is to the single, onedimensional random variable X, needs to be extended to higher dimensions;
and doing so is the sole objective of this chapter. With the introduction of a
few new concepts, new varieties of the probability distribution function (pdf)
emerge along with new variations on familiar results; together, they expand
and supplement what we already know about random variables and bring to
a conclusion the discussion we started in Chapter 4.
137
138
5.1
5.1.1
Random Phenomena
139
5.1.2
As with the single random variable case, associated with this twodimensional random variable is a space, V , and a probability set function
PX induced by X = (X1 , X2 ), where V is dened as:
V = {(x1 , x2 ) : X1 () = x1 , X2 () = x2 ; }
(5.1)
The most important point to note at this point is that the random variable
space V involves X1 and X2 simultaneously; it is not merely a union of separate spaces V1 for X1 and V2 for X2 .
An example of a bivariate random variable was presented in Example 4.4
in Chapter 4; here is another.
Example 5.1 BIVARIATE RANDOM VARIABLE AND INDUCED PROBABILITY FUNCTION FOR COIN TOSS EXPERIMENT
Consider an experiment involving tossing a coin 2 times and recording
the number of observed heads and tails: (1) Obtain the sample space ;
and (2) Dene X as a two-dimensional random variable (X1 , X2 ) where
X1 is the number of heads obtained in the rst toss, and X2 is the number of heads obtained in the second toss. Obtain the new space V . (3)
Assuming equiprobable outcomes, obtain the induced probability PX .
140
Random Phenomena
Solution:
(1) From the nature of the experiment, the required sample space, , is
given by
= {HH, HT, T H, T T }
(5.2)
consisting of all 4 possible outcomes, which may be represented respectively, as i ; i = 1, 2, 3, 4, so that
= {1 , 2 , 3 , 4 }.
(5.3)
(5.4)
since these are all the possible values that the two-dimensional X can
take.
(3) This is a case where there is a direct one-to-one mapping between
the 4 elements of the original sample space and the induced random
variables space V ; as such, for equiprobable outcomes, we obtain,
PX (1, 1)
1/4
PX (1, 0)
1/4
PX (0, 1)
1/4
PX (0, 0)
1/4
(5.5)
In making sense of the formal denition given here for the bivariate (2dimensional) random variable, the reader should keep in mind the practical
considerations presented in Chapter 4 for the single random variable. The same
issues there apply here. In a practical sense, the bivariate random variable
may be considered simply, if informally, as an experimental outcome with two
components, each with numerical values that are subject to random variations
with exact replicate performance of the experiment.
For example, consider a polymer used for packaging applications, for which
the quality measurements of interest are melt index (indicative of the molecular weight distribution), and density (indicative of co-polymer composition). With each performance of lab analysis on samples taken from the manufacturing process, the values obtained for each of these quantities are subject
to random variations. Without worrying so much about the original sample
space or the induced one, we may consider the packaging polymer quality characteristics directly as the two-dimensional random variable whose components
are melt index (as X1 ), and density (as X2 ).
We now note that it is fairly common for many textbooks to use X and Y
to represent bivariate random variables. We choose to use X1 and X2 because
it oers a notational convenience that facilitates generalization to n > 2.
5.1.3
141
5.2
5.2.1
1/4; x1 = 1, x2
1/4; x1 = 1, x2
1/4; x1 = 0, x2
f (x1 , x2 ) =
1/4; x1 = 0, x2
0;
otherwise
=1
=0
=1
=0
(5.7)
showing how the probabilities are distributed over the 2-dimensional random variable space, V . Once again, we note the following about the function
f (x1 , x2 ):
f (x1 , x2 ) > 0; x1 , x2
x1
x2 f (x1 , x2 ) = 1
142
Random Phenomena
These results are direct extensions of the axiomatic statements given earlier
for the discrete single random variable pdf.
The probability that both X1 < x1 and X2 < x2 is given by the cumulative
distribution function,
F (x1 , x2 ) = P (X1 < x1 , X2 < x2 )
(5.8)
F (x1 , x2 )
x1 x2
(5.9)
is called the joint probability density function for the continuous twodimensional random variables X1 and X2 . As with the discrete case, the formal
properties of the continuous joint pdf are:
1. f (x1 , x2 ) 0; x1 , x2 V ;
2. f has at most a nite number of discontinuities in every nite interval
in V ;
3. The double integral, x1 x2 f (x1 , x2 )dx1 dx2 = 1;
4. PX (A) = A f (x1 , x2 )dx1 , dx2 ; for A V
143
Thus,
P (a1 X1 a2 ; b1 X2 b2 ) =
b2
a2
(5.10)
a1
0
elsewhere
(1) Establish that this is a legitimate pdf; and (2) obtain the probability
that the system lasts more than two years; (3) obtain the probability
that the electronic component functions for more than 5 years and the
control valve for more than 10 years.
Solution:
(1) If this is a legitimate joint pdf, then the following should hold:
f (x1 , x2 )dx1 dx2 = 1
(5.12)
0
=
=
1
5e0.2x1 0
10e0.1x2 0
50
1
(5.13)
(5.15)
Thus, the probability that the system lasts beyond the rst two years
144
Random Phenomena
is 0.549.
(3) The required probability, P (X1 > 5; X2 > 10) is obtained as:
1 (0.2x1 +0.1x2 )
e
dx1 dx2
P (X1 > 5; X2 > 10) =
50
10
5
e0.1x2 10 = (0.368)2
=
e0.2x1 5
=
0.135
(5.16)
5.2.2
Marginal Distributions
Consider the joint pdf f (x1 , x2 ) for the 2-dimensional random variable
(X1 , X2 ); it represents how probabilities are jointly distributed over the entire
(X1 , X2 ) plane in the random variable space. Were we to integrate over the
entire range of X2 (or sum over the entire range in the discrete case), what is
left is the following function of x1 in the continuous case:
f1 (x1 ) =
f (x1 , x2 )dx2
(5.17)
f (x1 , x2 )
(5.18)
x2
This function, f1 (x1 ), characterizes the behavior of X1 alone, by itself, regardless of what is going on with X2 .
Observe that, if one wishes to determine P (a1 < X1 < a2 ) with X2 taking
any value, by denition, this probability is determined as:
a2
f (x1 , x2 )dx2 dx1
(5.19)
P (a1 < X1 < a2 ) =
a1
an expression that is reminiscent of probability computations for single random variable pdfs.
145
f (x1 , x2 )dx1 ,
(5.21)
obtained by integrating out X1 from the joint pdf of X1 and X2 ; or, in the
discrete case, it is:
f2 (x2 ) =
f (x1 , x2 )
(5.22)
x1
These pdfs, f1 (x1 ) and f2 (x2 ), respectively represent the probabilistic characteristics of each random variable X1 and X2 considered in isolation, as opposed to f (x1 , x2 ) that represents the joint probabilistic characteristics when
considered together. The formal denitions are given as follows:
and
f2 (x2 ) =
f (x1 , x2 )dx1
(5.24)
for continuous random variables, and, for discrete random variables, as the functions:
f1 (x1 ) =
f (x1 , x2 )
(5.25)
x2
and
f2 (x2 ) =
f (x1 , x2 )
(5.26)
x1
Each marginal pdf possesses all the usual properties of pdfs, i.e., for continuous random variables,
146
Random Phenomena
with the integrals are replaced with sums for the discrete case. An illustrative example follows.
Example 5.3 MARGINAL DISTRIBUTIONS OF CONTINUOUS BIVARIATE RANDOM VARIABLE
Find the marginal distributions of the joint pdfs given in Example 5.2
for characterizing the reliability of the commercial polymer reactors
temperature control system. Recall that the component random variables are X1 , the lifetimes (in years) of the control hardware electronics,
and X2 , the lifetime of the control valve on the cooling water line; the
joint pdf is as given in Eq (5.11):
f (x1 , x2 ) =
1 (0.2x1 +0.1x2 )
e
;
50
0 < x1 <
0 < x2 <
elsewhere
Solution:
(1) For this continuous bivariate random variable, we have from Eq
(5.17) that:
1 (0.2x1 +0.1x2 )
dx2
e
f1 (x1 ) =
50
0
1 0.2x1
1
=
e0.1x2 dx2 = e0.2x1
(5.27)
e
50
5
0
Similarly, from Eq (5.21), we have,
1 (0.2x1 +0.1x2 )
f2 (x2 ) =
dx1
e
50
0
1 0.1x2
1 0.1x2
=
e0.2x1 dx1 =
e
e
50
10
0
(5.28)
As an exercise, the reader should conrm that each of these marginal distributions is a legitimate pdf in its own right.
These ideas extend directly to n > 2 random variables whose joint pdf
is given by f (x1 , x2 , , xn ). There will be n separate marginal distributions
fi (xi ); i = 1, 2, , n, each obtained by integrating (or summing) out every
other random variable except the one in question, i.e.,
f1 (x1 ) =
147
(5.30)
It is important to note that when n > 2, marginal distributions themselves
can be multivariate. For example, f12 (x1 , x2 ) is what is left of the joint pdf
f (x1 , x2 , , xn ) after integrating (or summing) over the remaining (n 2)
variables; it is a bivariate pdf of the two surviving random variables of interest.
The concepts are simple and carry over directly; however, the notation can
become quite confusing if one is not careful. We shall return to this point a
bit later in this chapter.
5.2.3
Conditional Distributions
If the joint pdf f (x1 , x2 ) of a bivariate random variable provides a description of how the two component random variables vary jointly; and if the
marginal distributions f1 (x1 ) and f2 (x2 ) describe how each random variable
behaves by itself, in isolation, without regard to the other; there remains yet
one more characteristic of importance: a description of how X1 behaves for
given specic values of X2 , and vice versa, how X2 behaves for specic values
of X1 (i.e., the probability distribution of X1 conditioned upon X2 taking on
specic values, and vice versa). Such conditional distributions are dened
as follows:
f (x1 , x2 )
; f2 (x2 ) > 0
f2 (x2 )
(5.31)
The similarity between these equations and the expression for conditional
probabilities of events dened as sets, as given in Eq (3.40) of Chapter 3
P (A|B) =
P (A B)
P (B)
(5.33)
148
Random Phenomena
the numerator of which is recognized from Eq (5.21) as the marginal distribution of X2 so that:
f2 (x2 )
=1
(5.35)
f (x1 |x2 )dx1 =
f2 (x2 )
The same result holds for f (x2 |x1 ) in Eq (5.32) when integrated with respect
of x2 ; and, by replacing the integrals with sums, we obtain identical results
for the discrete case.
Example 5.4 CONDITIONAL DISTRIBUTIONS OF CONTINUOUS BIVARIATE RANDOM VARIABLE
Find the conditional distributions of the 2-dimensional random variables given in Example 5.2 for the reliability of a temperature control
system.
Solution:
Recall from the previous examples that the joint pdf is:
1 (0.2x +0.1x )
1
2
; 0 < x1 <
50 e
f (x1 , x2 ) =
0 < x2 <
0
elsewhere
Recalling the result obtained in Example 5.3 for the marginal pdfs
f1 (x1 ) and f2 (x2 ), the desired conditional pdfs are given as follows:
f (x1 |x2 )
1 (0.2x1 +0.1x2 )
e
50
1 0.1x2
e
10
1 0.2x1
e
5
(5.36)
1 (0.2x1 +0.1x2 )
e
50
1 0.2x1
e
5
1 0.1x2
e
10
(5.37)
The reader may have noticed two things about this specic example: (i)
f (x1 |x2 ) is entirely a function of x1 alone, containing no x2 whose value is to
be xed; the same is true for f (x2 |x1 ) which is entirely a function of x2 , with
no dependence on x1 . (ii) In fact, not only is f (x1 |x2 ) a function of x1 alone;
it is precisely the same function as the unconditional marginal pdf f1 (x1 )
obtained earlier. The same is obtained for f (x2 |x1 ), which also turns out to
149
f(x1,x2)
1
2.0
0
1.5
0.0
0.5
x2
1.0
x1
1.0
FIGURE 5.1: Graph of the joint pdf for the 2-dimensional random variable of Example
5.5
be the same as the unconditional marginal pdf f2 (x2 ) obtained earlier. Such
circumstances do not always occur for all 2-dimensional random variables, as
the next example shows; but the special cases where f (x1 |x2 ) = f1 (x1 ) and
f (x2 |x1 ) = f2 (x2 ) are indicative of a special relationship between the two
random variables X1 and X2 , as discussed later in this chapter.
Example 5.5 CONDITIONAL DISTRIBUTIONS OF ANOTHER CONTINUOUS BIVARIATE RANDOM VARIABLE
Find the conditional distributions of the 2-dimensional random variables whose joint pdf is given as follows:
x1 x2 ; 1 < x1 < 2
(5.38)
0 < x2 < 1
f (x1 , x2 ) =
0
elsewhere
shown graphically in Fig 5.1.
Solution:
To nd the conditional distributions, we must rst nd the marginal
distributions. (As an exercise, the reader may want to conrm that this
joint pdf is a legitimate pdf.) These marginal distributions are obtained
as follows:
1
1
x2
(x1 x2 )dx2 = x1 x2 2
(5.39)
f1 (x1 ) =
2
0
0
which simplies to give:
f1 (x1 ) =
(x1 0.5);
0;
1 < x1 < 2
elsewhere
(5.40)
150
Random Phenomena
Similarly,
f2 (x2 ) =
2
1
(x1 x2 )dx1 =
2
x21
x1 x2
2
1
(5.41)
(1.5 x2 );
0;
0 < x2 < 1
elsewhere
(5.42)
Again the reader may want to conrm that these marginal pdfs are
legitimate pdfs.
With these marginal pdfs in hand, we can now determine the required conditional distributions as follows:
f (x1 |x2 ) =
(x1 x2 )
; 1 < x1 < 2;
(1.5 x2 )
(5.43)
f (x2 |x1 ) =
(x1 x2 )
; 0 < x2 < 1;
(x1 0.5)
(5.44)
and
(The reader should be careful to note that we did not explicitly impose
the restrictive conditions x2 = 1.5 and x1 = 0.5 in the expressions given
above so as to exclude the respective singularity points for f (x1 |x2 ) and
for f (x2 |x1 ). This is because the original space over which the joint
distribution f (x1 , x2 ) was dened, V = {(x1 , x2 ) : 1 < x1 < 2; 0 < x2 <
1}, already excludes these otherwise troublesome points.)
Observe now that these conditional distributions show mutual dependence of x1 and x2 , unlike in Example 5.4. In particular, say for
x2 = 1 (the rightmost edge of the x2 -axis of the plane in Fig 5.1), the
conditional pdf f (x1 |x2 ) becomes:
f (x1 |x2 = 1) = 2(x1 1); 1 < x1 < 2;
(5.45)
whereas, for x2 = 0 (the leftmost edge of the x2 -axis of the plane in Fig
5.1), this conditional pdf becomes
f (x1 |x2 = 0) =
2x1
; 1 < x1 < 2;
3
(5.46)
Similar arguments can be made for f (x2 |x1 ) and are left as an exercise
for the reader.
The following example provides a comprehensive illustration of these distributions specically for a discrete bivariate random variable.
Example 5.6 DISTRIBUTIONS OF DISCRETE BIVARIATE
RANDOM VARIABLE
An Apple computer store in a small town stocks only three types of
hardware components: low-end, mid-level and high-end, selling
respectively for $1600, $2000 and $2400; it also only stocks two types
of monitors, the 20-inch type, selling for $600, and the 23-inch type,
TABLE 5.1:
pdf for computer
sales
X2 $600
X1
$1600 0.30
$2000 0.20
$2400 0.10
Joint
store
$900
0.25
0.10
0.05
(1) Show that f (x1 , x2 ) is a legitimate pdf and nd the sales combination (x1 , x2 ) with the highest probability, and the one with the lowest
probability.
(2) Obtain the marginal pdfs f1 (x1 ) and f2 (x2 ), and from these
compute P (X1 = $2000), regardless of X2 , (i.e., the probability of selling
a mid-level hardware component regardless of the monitor paired with
it). Also obtain P (X2 = $900) regardless of X1 , (i.e., the probability of
selling a 23-inch monitor, regardless of the hardware component with
which it is paired).
(3) Obtain the conditional pdfs f (x1 |x2 ) and f (x2 |x1 ) and determine
the highest value for each conditional probability; describe in words
what each means.
Solution:
(1) If f (x1 , x2 ) is a legitimate pdf, then it must hold that
f (x1 , x2 ) = 1
(5.47)
x2
x1
From the joint pdf shown in the table, this amounts to adding up all
the 6 entries, a simple arithmetic exercise that yields the desired result.
The combination with the highest probability is seen to be X1 =
$1600; X2 = $400 since P (X1 = $1600; X2 = $400) = 0.3; i.e., the
probability is highest (at 0.3) that any customer chosen at random
would have purchased the low-end hardware (for $1600) and the 20inch monitor (for $600). The lowest probability of 0.05 is associated
with X1 = $2400 and X2 = $900, i.e., the combination of a high-end
hardware component and a 23-inch monitor.
(2) By denition, the marginal pdf f1 (x1 ) is given by:
f1 (x1 ) =
f (x1 , x2 )
(5.48)
x2
151
152
Random Phenomena
so that, from the table, f1 (1600) = 0.3 + 0.25 = 0.55; similarly,
f1 (2000) = 0.30 and f1 (2400) = 0.15. In the same manner, the values for f2 (x2 ) are obtained as f2 (600) = 0.30 + 0.20 + 0.10 = 0.60,
and f2 (900) = 0.4. These values are combined with the original joint
pdf into a new Table 5.2 to provide a visual representation of the relationship between these distributions. The required probabilities are
TABLE 5.2:
pdfs for computer
X2
X1
$1600
$2000
$2400
f2 (x2 )
f1 (2000) = 0.30
(5.49)
P (X2 = $900)
f2 (900) = 0.40
(5.50)
f (x1 , x2 )
f (x1 , x2 )
; and f (x2 |x1 ) =
f2 (x2 )
f1 (x1 )
(5.51)
and upon carrying out the indicated divisions using the numbers contained in Table 5.2, we obtain the result shown in Table 5.3 for f (x1 |x2 ),
and in Table 5.4 for f (x2 |x1 ). From these tables, we obtain the highest conditional probability for f (x1 |x2 ) as 0.625, corresponding to the
probability of a customer buying the low end hardware component
(X1 = $1600) conditioned upon having bought the 23-inch monitor
(X2 = $900); i.e., in the entire population of those who bought the 23inch monitor, the probability is highest at 0.625 that a low-end hardware
component was purchased to go along with the monitor. When the conditioning variable is the hardware component, the highest conditional
probability f (x2 |x1 ) is a tie at 0.667 for customers buying the 20-inch
monitor (X2 = $600) conditioned upon buying the mid-range hardware
(X1 = $2000), and those buying the high-end hardware (X1 = $2400).
TABLE 5.3:
153
TABLE 5.4:
5.2.4
General Extensions
f (x1 , x2 , x3 , x4 , x5 )
f23 (x2 , x3 )
(5.52)
where f23 (x2 , x3 ) is the bivariate joint marginal pdf for cholesterol level. We
see therefore that the principles transfer quite directly, and, when dealing with
specic cases in practice (as we have just done), there is usually no confusion.
The challenge is how to generalize without confusion.
To present the results in a general fashion and avoid confusion requires
adopting a dierent notation: using the vector X to represent the entire collection of random variables, i.e., X = (X1 , X2 , , Xn ), and then partitioning
this into three distinct vectors: X , the variables of interest (X4 , X5 in the
Avandia example given above); Y, the conditioning variables (X2 , X3 in the
Avandia example), and Z, the remaining variables, if any. With this notation,
we now have
f (x , y, z)
(5.53)
f (x |y) =
fy (y)
as the most general multivariate conditional distribution.
154
Random Phenomena
5.3
The concepts of mathematical expectation and moments used to characterize the distribution of single random variables in Chapter 4 can be extended to
multivariate, jointly distributed random variables. Even though we now have
many more versions of pdfs to consider (joint, marginal and conditional), the
primary notions remain the same.
5.3.1
Expectations
(5.54)
a direct extension of the single variable denition. The discrete counterpart
is:
E[U (X)] =
U (x1 , x2 , , xn )f (x1 , x2 , , xn )
(5.55)
x1
x2
xn
(5.56)
155
(5.59)
The immediate implication is that the expected lifetime dierential favors the control valve (lifetime X2 ) so that the control hardware electronic component is expected to fail rst, with the control valve expected
to outlast it by 5 years.
Example 5.8 EXPECTATIONS OF DISCRETE BIVARIATE
RANDOM VARIABLE
From the joint pdf given in Example 5.6 for the Apple computer store
sales, obtain the expected revenue from each recorded sale.
Solution:
Recall that for this problem, the random variables of interest are X1 ,
the cost of the computer hardware component, and X2 , the cost of the
monitor in each recorded sale. The appropriate function U (X1 , X2 ), in
this case is
(5.60)
U (X1 , X2 ) = X1 + X2
the total amount of money realized on each sale. By the denition of
expectations for the discrete bivariate random variable, we have
E[U (X1 , X2 )] =
(x1 + x2 )f (x1 , x2 )
(5.61)
x2
x1
x2
(5.62)
In the special case where U (X) = e(t1 X1 +t2 X2 ) , the expectation, E[U (X)]
is the joint moment generating function, M (t1 , t2 ), for the bivariate random
variable X = (X1 , X2 ) dened by
(t X +t X )
e 1 1 2 2 f (x1 , x2 )dx1 dx2 ;
(t1 X1 +t2 X2 )
]=
M (t1 , t2 ) = E[e
(t1 X1 +t2 X2 )
f (x1 , x2 );
x1
x2 e
(5.63)
for the continuous and the discrete cases, respectively an expression that
generalizes directly for the n-dimensional random variable.
156
Random Phenomena
Marginal Expectations
Recall that for the general n-dimensional random variable X =
(X1 , X2 , Xn ), the single variable marginal distribution fi (xi ) is the distribution of the component random variable Xi alone, as if the others did not
exist. It is therefore similar to the single random variable pdf dealt with extensively in Chapter 4. As such, the marginal expectation of U (Xi ) is precisely
as dened in Chapter 4, i.e.,
U (xi )fi (xi )dxi
(5.64)
E[U (Xi )] =
(5.65)
xi
(5.68)
(5.69)
are, respectively, the marginal MGFs for f1 (x1 ) and for f2 (x2 ).
Keep in mind that in the general case, marginal distributions can be multivariate; in this case, the context of the problem at hand will make clear what
such a joint-marginal distribution will look like after the remaining variables
have been integrated out.
Conditional Expectations
As in the discussion about conditional distributions, it is best to deal with
the bivariate conditional expectations rst. For the bivariate random variable
157
X = (X1 , X2 ), the conditional expectation E[U (X1 )|X2 ] (i.e the expectation
of the function U (X1 ) conditioned upon X2 = x2 ) is obtained from the conditional distribution as follows:
U (x1 )f (x1 |x2 )dx1 ; continuous X
(5.70)
E[U (X1 )|X2 ] =
U
(x
)f
(x
|x
);
discrete
X
1
1 2
x1
with a corresponding expression for E[U (X2 )|X1 ] based on the conditional
distribution f (x2 |x1 ). In particular, when U (X1 ) = X1 (or, U (X2 ) = X2 ), the
result is the conditional mean dened by:
x1 f (x1 |x2 )dx1 ; continuous X
xi
x1 f (x1 |x2 );
2
X
1 |x2
(5.72)
as:
(x1 X1 |x2 )2 f (x1 |x2 )dx1 ;
(5.71)
discrete X
2
x1 (x1 X1 |x2 ) f (x1 |x2 );
(5.73)
5.3.2
(5.75)
158
Random Phenomena
where 1 and 2 are the positive square roots of the respective marginal
variances of X1 and X2 . is known as the correlation coecient, with the
attractive property that
1 1
(5.77)
The most important points to note about the covariance, 12 , or the correlation coecient, are as follows:
1. 12 will be positive if values of X1 > X1 are generally associated with
values of X2 > X2 , or when values of X1 < X1 tend to be associated with values of X2 < X2 . Such variables are said to be positively
correlated and will be positive ( > 0), with the strength of the correlation indicated by the absolute value of : weakly correlated variables
will have low values close to zero while strongly correlated variables will
have values close to 1. (See Fig 5.2.) For perfectly positively correlated
variables, = 1.
2. The reverse is the case when 12 is negative: for such variables, values
of X1 > X1 appear preferentially together with values of X2 < X2 ,
or else values of X1 < X1 tend to be associated more with values of
X2 > X2 . In this case, the variables are said to be negatively correlated
and will be negative ( < 0); once again, with the strength of correlation indicated by the absolute values of . (See Fig 5.3). For perfectly
negatively correlated variables, = 1.
3. If the behavior of X1 has little or no bearing with that of X2 , as one
might expect, 12 and will tend to be close to zero (See Fig 5.4); and
when the two random variables are completely independent of each
other, then both 12 and will be exactly zero.
This last point brings up the concept of stochastic independence.
5.3.3
Independence
159
40
X2
30
20
10
0
1
5
X1
50
40
X2
30
20
10
0
1
5
X1
160
Random Phenomena
35
30
X2
25
20
15
10
5
1
5
X1
f (x1 , x2 )
f1 (x1 )
(5.78)
(5.79)
161
(5.80)
which, rst of all, is item 3 in the denition above, but just as importantly,
when substituted into the numerator of the expression in Eq (5.31), i.e.,
f (x1 |x2 ) =
f (x1 , x2 )
f2 (x2 )
(5.81)
f (x1 , 0) + f (x1 , 1)
(5.82)
(5.83)
162
Random Phenomena
TABLE 5.5:
pdfs for two-coin
Example 5.1
X2
X1
0
1
f2 (x2 )
1/4 1/4
1/4 1/4
1/2 1/2
1/2;
1/2;
f1 (x1 ) =
0;
1/2;
1/2;
f2 (x2 ) =
0;
f1 (x1 )
1/2
1/2
1
x1 = 0
x1 = 1
otherwise
(5.84)
x2 = 0
x2 = 1
otherwise
(5.85)
If we now tabulate the joint pdf and the marginal pdfs, we obtain the
result in Table 5.5. It is now clear that for all x1 and x2 ,
f (x1 , x2 ) = f1 (x1 )f2 (x2 )
(5.86)
(5.87)
= 0
= 0
(5.88)
(5.89)
since, by denition,
12 = E[(X1 X1 ).(X2 X2 )]
(5.90)
(5.91)
163
5.4
The primary objective of this chapter was to extend the ideas presented in
Chapter 4 for the single random variable to the multidimensional case, where
the outcome of interest involves two or more random variables simultaneously.
With such higher-dimensional random variables, it became necessary to introduce a new variety of pdfs dierent from, but still related to, the familiar one
encountered in Chapter 4: the joint pdf to characterize joint variation among
the variables; the marginal pdfs to characterize individual behavior of each
variable in isolation from others; and the conditional pdfs, to characterize the
behavior of one random variable conditioned upon xing the others at prespecied values. This new array of pdfs provide the full set of mathematical
tools for characterizing various aspects of multivariate random variables much
as the f (x) of Chapter 4 did for single random variables.
The possibility of two or more random variables co-varying simultaneously, which was not of concern with single random variables, led to the introduction of two additional and related quantities, co-variance and correlation,
with which one quanties the mutual dependence of two random variables.
This in turn led to the important concept of stochastic independence, that
one random variable is entirely unaected by another. As we shall see in subsequent chapters, when dealing with multiple random variables, the analysis
of joint behavior is considerably simplied if the random variables in question
164
Random Phenomena
are independent. We shall therefore have cause to recall some of the results of
this chapter at that time.
Here are some of the main points of this chapter again.
A multivariate random variable is dened in the same manner as a single
random variable, but the associated space, V , is higher-dimensional;
The joint pdf of a bivariate random variable, f (x1 , x2 ), shows how
the probabilities are distributed over the two-dimensional random variable space; the joint cdf, F (x1 , x2 ), represents the probability, P (X1 <
x1 ; X2 < x2 ); they both extend directly to higher-dimensional random
variables.
In addition to the joint pdf, two other pdfs are needed to characterize
multi-dimensional random variables fully:
Marginal pdf : fi (xi ) characterizes the individual behavior of each
random variable, Xi , by itself, regardless of the others;
Conditional pdf : f (xi |xj ) characterizes the behavior of Xi conditioned upon Xj taking on specic values.
These pdfs can be used to obtain such random variable characteristics
as joint, marginal and conditional expectations.
The covariance of two random variables, X1 and X2 , dened as
12 = E[(X1 X1 )(X2 X2 )]
(where X1 and X2 , are respective marginal expectations), provides
a measure of the mutual dependence of variations in X1 and X2 . The
related correlation coecient, the scaled quantity:
=
12
1 2
(where 1 and 2 are the positive square roots of the respective marginal
variances of X1 and X2 ), has the property that 1 1, with || indicating the strength of the mutual dependence, and the sign indicating
the direction (negative or positive).
Two random variables, X1 and X2 , are independent if the behavior of
one has no bearing on the behavior of the other; more formally,
f (x1 |x2 ) = f1 (x1 ); f (x2 |x1 ) = f2 (x2 );
so that,
f (x1 , x2 ) = f (x1 )f (x2 )
165
REVIEW QUESTIONS
1. What characteristic of the Avandia clinical test makes it relevant to the discussion of this chapter?
2. How many random variables at a time can the probability machinery of Chapter
4 deal with?
3. In dealing with several random variables simultaneously, what are some of the
questions to be considered that were not of concern when dealing with single random
variables in Chapter 4?
4. Dene a bivariate random variable formally.
5. Informally, what is a bivariate random variable?
6. Dene a multivariate random variable formally.
7. State the axiomatic denition of the joint pdf of a discrete bivariate random variable and of its continuous counterpart.
8. What is the general relationship between the cdf, F (x1 , x2 ), of a continuous bivariate random variable and its pdf, f (x1 , x2 )? What conditions must be satised
for this relationship to exist?
9. Dene the marginal distributions, f1 (x1 ) and f2 (x2 ), for a two-dimensional random variable with a joint pdf f (x1 , x2 ).
10. Do marginal pdfs possess the usual properties of pdfs or are they dierent?
11. Given a bivariate joint pdf, f (x1 , x2 ), dene the conditional pdfs, f (x1 |x2 ) and
f (x2 |x1 ).
12. In what way is the denition of a conditional pdf similar to the conditional
probability of events A and B dened on a sample space, ?
13. Dene the expectation, E[(U (X1 , X2 )], for a bivariate random variable. Extend
this to an n-dimensional (multivariate) random variable.
14. Dene the marginal expectation, E[(U (Xi )], for a bivariate random variable.
Extend this to an n-dimensional (multivariate) random variable.
15. Dene the conditional expectations, E[(U (X1 )|X2 ] and E[(U (X2 )|X1 ], for a bivariate random variable.
16. Given two random variables, X1 and X2 , dene their covariance.
17. What is the relationship between covariance and the correlation coecient?
166
Random Phenomena
18. What does a negative correlation coecient indicate about the relationship between two random variables, X1 and X2 ? What does a positive correlation coecient
indicate?
19. If the behavior of the random variable, X1 , has little bearing on that of X2 , how
will this manifest in the value of the correlation coecient, ?
20. When the correlation coecient of two random variables, X1 and X2 , is such
that || 1, what does this indicate about the random variables?
21. What does it mean that two random variables, X1 and X2 , are stochastically
independent?
22. If two random variables are independent, what is the value of their covariance,
and of their correlation coecient?
23. When dealing with n > 2 random variables, what is the dierence between
pairwise stochastic independence and mutual stochastic independence? Does one
always imply the other?
EXERCISES
Sections 5.1 and 5.2
5.1 Revisit Example 5.1 in the text and dene the two-dimensional random variable
(X1 , X2 ) as follows: X1 is the total number of heads, and X2 is the total number of tails. Obtain the space, V , and determine the complete pdf, f (x1 , x2 ), for
x1 = 0, 1, 2; x2 = 0, 1, 2, assuming equiprobable outcomes in the original sample
space.
5.2 The two-dimensional random variable (X1 , X2 ) has the following joint pdf:
f (1, 1) = 14 ;
f (1, 2) = 18 ;
1
f (1, 3) = 16
;
f (2, 1) =
f (2, 2) =
f (2, 3) =
3
8
1
8
1
16
(i) Determine the following probabilities: (a) P (X1 X2 ); (b) P (X1 + X2 = 4); (c)
P (|X2 X1 | = 1); (d) P (X1 + X2 is even).
(ii) Obtain the joint cumulative distribution function, F (x1 , x2 ).
5.3 In a game of chess, one player either wins, W , loses, L, or draws, D (either by
mutual agreement with the opponent, or as a result of a stalemate). Consider a
player participating in a two-game, pre-tournament qualication series:
(i) Obtain the sample space, .
(ii) Dene the two-dimensional random variable (X1 , X2 ) where X1 is the total
number of wins, and X2 is the total number of draws. Obtain V and, assuming
equiprobable outcomes in the original sample space, determine the complete joint
pdf, f (x1 , x2 ).
(iii) If the player is awarded 3 points for a win, 1 point for a draw and no point for a
loss, dene the random variable Y as the total number of points assigned to a player
167
at the end of the two-game preliminary round. If a player needs at least 4 points to
qualify, determine the probability of qualifying.
5.4 Revisit Exercise 5.3 above but this time consider three players: Suzie, the superior player for whom the probability of winning a game, pW = 0.75, the probability of
drawing, pD = 0.2 and the probability of losing, pL = 0.05; Meredith, the mediocre
player for whom pW = 0.5; pD = 0.3; PL = 0.2; and Paula, the poor player, for
whom pW = 0.2; pD = 0.3; PL = 0.5. Determine the complete joint pdf for each
player, fS (x1 , x2 ), for Suzie, fM (x1 , x2 ), for Meredith, and fP (x1 , x2 ), for Paula;
and from these, determine for each player, the probability that she qualies for the
tournament.
5.5 The continuous random variables X1 and X2 have the joint pdf
cx1 x2 (1 x2 ); 0 < x1 < 2; 0 < x2 < 1
f (x, y) =
0;
elsewhere
(5.93)
0
0
1/4
0
1/2
0
1/4
0
0
(i) Obtain the marginal pdfs, f1 (x1 ) and f2 (x2 ), and determine whether or not X1
and X2 are independent.
(ii) Obtain the conditional pdfs f (x1 |x2 ) and f (x2 |x1 ). Describe in words what these
results imply in terms of the original experiments and these random variables.
(iii) It is conjectured that this joint pdf is for an experiment involving tossing a fair
coin twice, with X1 as the total number of heads, and X2 as the total number of
tails. Are the foregoing results consistent with this conjecture? Explain.
5.8 Given the joint pdf:
f (x1 , x2 ) =
ce(x1 +x2 ) ;
0;
(5.94)
First obtain c, then obtain the marginal pdfs f1 (x1 ) and f2 (x2 ), and hence determine
whether or not X1 and X2 are independent.
168
Random Phenomena
5.9 If the range of validity of the joint pdf in Exercise 5.8 and Eq (5.94) are modied
to 0 < x1 < and 0 < x2 < , obtain c and the marginal pdf, and then determine
whether or not these random variables are now independent.
Section 5.3
5.10 Revisit Exercise 5.3. From the joint pdf determine
(i) E[U (X1 , X2 ) = X1 + X2 ].
(ii) E[U (X1 , X2 ) = 3X1 + X2 ]. Use this result to determine if the player will be
expected to qualify or not.
5.11 For each of the three players in Exercise 5.4,
(i) Determine the marginal pdfs, f1 (x1 ) and f2 (x2 ) and the marginal means
X1 , X2 .
(ii) Determine E[U (X1 , X2 ) = 3X1 + X2 ] and use the result to determine which of
the three players, if any, will be expected to qualify for the tournament.
5.12 Determine the covariance and correlation coecient for the two random variables whose joint pdf, f (x1 , x2 ) is given in the table in Exercise 5.7.
5.13 For each of the three chess players in Exercise 5.4, Suzie, Meredith, and Paula,
and from the joint pdf of each players performance at the pre-tournament qualifying
games, determine the covariance and correlation coecients for each player. Discuss
what these results imply in terms of the relationship between wins and draws for
each player.
5.14 The joint pdf for two random variables X and Y is given as:
x + y; 0 < x < 1; 0 < y < 1;
f (x, y) =
0;
elsewhere
(5.95)
(i) Obtain f (x|y and f (y|x) and show that these two random variables are not
independent.
(ii) Obtain the covariance, XY , and the correlation coecient, . Comment on the
strength of the correlation between these two random variables.
APPLICATION PROBLEMS
5.15 Refer to Application Problem 3.23 in Chapter 3, where the relationship between
a blood assay used to determine lithium concentration in blood samples and lithium
toxicity in 150 patients was presented in a table reproduced here for ease of reference.
Assay
A+
A
Total
Lithium
L+
30
21
51
Toxicity
L
17
82
92
Total
47
103
150
169
(i) In general, consider the assay result as the random variable Y having two possible
outcomes y1 = A+ , and y2 = A ; and consider the true lithium toxicity status as
the random variable X also having having two possible outcomes x1 = L+ , and
x2 = L . Now consider that the relative frequencies (or proportions) indicated in
the data table can be approximately considered as close enough to true probabilities;
convert the data table to a table of joint probability distribution f (x, y). What is
the probability that the test method will produce the right result?
(ii) From the table of the joint pdf, compute the following probabilities and explain what they mean in words in terms of the problem at hand: f (y2 |x2 ); f (y1 |x2 );
f (y2 |x1 ).
5.16 The reliability of the temperature control system for a commercial, highly
exothermic polymer reactor presented in Example 5.2 in the text is known to depend
on the lifetimes (in years) of the control hardware electronics, X1 , and of the control
valve on the cooling water line, X2 ; the joint pdf is:
1 (0.2x +0.1x )
1
2
; 0 < x1 <
50 e
f (x1 , x2 ) =
0 < x2 <
0
elsewhere
(i) Determine the probability that the control valve outlasts the control hardware
electronics.
(ii) Determine the converse probability that the controller hardware electronics outlast the control valve.
(iii) If a component is replaced every time it fails, how frequently can one expect to
replace the control valve, and how frequently can one expect to replace the controller
hardware electronics?
(iv) If it costs $20,000 to replace the control hardware electronics and $10,000 to
replace the control valve, how much should be budgeted over the next 20 years for
keeping the control system functioning, assuming all other characteristics remain
essentially the same over this period?
5.17 In a major bio-vaccine research company, it is inevitable that workers are exposed to some hazardous, but highly treatable, disease causing agents. According
to papers led with the Safety and Hazards Authorities of the state in which the
facility is located, the treatment provided is tailored to the workers age, (the variable, X: 0 if younger than 30 years; 1 if 31 years or older), and location in the
facility (a surrogate for virulence of the proprietary strains used in various parts of
the facility, represented by the variable Y = 1, 2, 3 or 4. The composition of the
2,500 employees at the companys research headquarters is shown in the table below:
Location
Age
< 30
31
6%
17%
20%
14%
13%
12%
10%
8%
(i) If a worker is infected at random so that the outcome is the bivariate random
variable (X, Y ) where X has two outcomes, and Y has four, obtain the pdf f (x, y)
from the given data (assuming each worker in each location has an equal chance of
infection); and determine the marginal pdfs f1 (x) and f2 (y).
170
Random Phenomena
(ii) What is the probability that a worker in need of treatment was infected in
location 3 or 4 given that he/she is < 30 years old?
(iii) If the cost of the treating each infected worker (in dollars per year) is given by
the expression
C = 1500 100Y + 500X
(5.96)
how much should the company expect to spend per worker every year, assuming the
worker composition remains the same year after year?
5.18 A non-destructive quality control test on a military weapon system correctly
detects a aw in the central electronic guidance subunit if one exists, or correctly
accepts the system as fully functional if no aw exists, 85% of the time; it incorrectly
identies a aw when one does not exist (a false positive), 5% of the time, and
incorrectly fails to detect a aw when one exists (a false negative), 10% of the time.
When the test is repeated 5 times under mostly identical conditions, if X1 is the
number of times the test is correct, and X2 is the number of times it registers a false
positive, the joint pdf of these two random variables is given as:
f (x1 , x2 ) =
120
0.85x1 0.05x2
x1 !x2 !
(5.97)
(i) Why is no consideration given in the expression in Eq (5.97) to the third random
variable, X3 , the number of times the test registers a false negative?
(ii) From Eq (5.97), generate a 5 5 table of f (x1 , x2 ) for all the possible outcomes
and from this obtain the marginal pdfs, f1 (x1 ) and f2 (x2 ). Are these two random
variables independent?
(iii) Determine the expected number of correct test results regardless of the other
results; also determine the expected value of false positives regardless of other results.
(iv) What is the expected number of the total number of correct results and false
positives? Is this value the same as the sum of the expected values obtained in (iii)?
Explain.
Chapter 6
Random Variable Transformations
6.1
6.2
6.3
6.4
6.5
171
172
173
173
175
176
177
177
179
181
184
184
185
188
188
189
190
192
Many problems of practical interest involve a random variable Y that is dened as a function of another random variable X, say according to Y = (X),
so that the characteristics of the one arise directly from those of the other via
the indicated transformation. In particular, if we already know the probability
distribution function for X as fX (x), it will be helpful to know how to determine the corresponding distribution function for Y . This chapter presents
techniques for characterizing functions of random variables, and the results,
important in their own right, become particularly useful in Part III where
probability models are derived for random phenomena of importance in engineering and science.
171
172
6.1
Random Phenomena
(6.1)
1 (X1 , X2 , . . . , Xn );
Y2 =
... =
2 (X1 , X2 , . . . , Xn );
...
Ym
m (X1 , X2 , . . . , Xn )
Y1
(6.2)
As demonstrated in later chapters, these results are extremely useful in deriving probability models for more complicated random variables from the
probability models of simpler ones.
6.2
(6.3)
(6.4)
173
exists and is also one-to-one. The procedure for obtaining fY (y) given fX (x)
is highly dependent on the nature of the random variable in question, being
more straightforward for the discrete case than for the continuous.
6.2.1
Discrete Case
(6.5)
We illustrate this straightforward result rst with the following simple example.
Example 6.1 LINEAR TRANSFORMATION OF A POISSON
RANDOM VARIABLE
As discussed in more detail in Part III, the discrete random variable X
having the following pdf:
fX (x) =
x e
; x = 0, 1, 2, 3, . . .
x!
(6.6)
P (Y = y) = P (X = y/2)
y/2 e
; y = 0, 2, 4, 6, . . .
(y/2)!
(6.9)
y/2 e
; y = 0, 2, 4, 6, . . .
(y/2)!
(6.10)
A Practical Application
The number of times, X, that each cell in a cell culture divides in a time
interval of length, t, is a random variable whose specic value depends on
many factors both intrinsic (e.g. individual cell characteristics) and extrinsic
174
Random Phenomena
(t)x e(t)
; x = 0, 1, 2, 3, . . .
x!
(6.11)
(6.12)
(6.14)
e log2 y
; y = 1, 2, 4, 8, . . .
(log2 y)!
(6.15)
(6.17)
fY (y) =
(6.18)
It is possible to conrm that the pdf obtained in Eq (6.18) for Y , the number
of cells in the culture after a time interval t is a valid pdf for which:
fY (y) = 1
(6.19)
y
175
fY (y)
= e
(2 )
22
23
2
+
+
+ ...
1+
1
2!
3!
= e(2 ) e(2
=1
(6.20)
The mean number of cells in the culture after time t, E[Y ], can be shown (see
end-of-chapter Exercise 6.2) to be:
E[Y ] = e
(6.21)
6.2.2
Continuous Case
(6.24)
dFY (y)
d
d
=
{FX [(y)]} = fX [(y)] {(y)}
dy
dy
dy
(6.25)
with the derivative on the RHS positive for a strictly monotonic increasing
function. It can be shown that if were monotonically decreasing, the expression in (6.24) will yield:
fY (y) = fX [(y)]
d
{(y)}
dy
(6.26)
with the derivative on the RHS as a negative quantity. Both results may be
combined into one as
d
fY (y) = fX [(y)] {(y)}
(6.27)
dy
as presented in Eq (6.23). Let us illustrate this with another example.
176
Random Phenomena
Example 6.2 LOG TRANSFORMATION OF A UNIFORM
RANDOM VARIABLE
The random variable X with the following pdf:
1; 0 < x < 1
(6.28)
fX (x) =
0; otherwise
is identied in Part III as the uniform random variable. Determine the
pdf for the random variable Y obtained via the transformation:
Y = ln X
(6.29)
Solution:
The transformation is one-to-one, maps VX = {0 < x < 1} onto VY =
{0 < y < }, and the inverse transformation is given by:
X = (y) = ey/ ; 0 < y < .
(6.30)
(6.31)
1 y/
e
;
0;
0<y<
otherwise
(6.32)
These two random variables and their corresponding models are discussed
more fully in Part III.
6.2.3
(6.33)
d
{i (y)}
dy
(6.34)
k
i=1
fX [i (y)]|Ji |
(6.35)
177
(6.36)
Determine the pdf for the random variable Y obtained via the transformation:
(6.37)
y = x2
Solution:
Observe that this transformation, which maps the space VX =
< x < onto VY = 0 < y < , is not one-to-one; for all y > 0
there are two xs corresponding to each y, since the inverse transformation is given by:
(6.38)
x= y
The transformation thus has 2 roots for x:
x1
1 (y) =
x2
2 (y) = y
(6.39)
(6.40)
(6.41)
6.2.4
Let us consider rst the case where the random variable transformation
involves the sum of two independent random variables, i.e.,
Y = (X1 , X2 ) = X1 + X2
(6.42)
where f1 (x1 ) and f2 (x2 ), are, respectively, the known pdfs of the random
variables X1 and X2 . Two approaches are typically employed in nding the
desired fY (y):
The cumulative distribution function approach;
The characteristic function approach.
178
Random Phenomena
x2
y= x1 + x2
Vy={(x1,x2): x1 + x2 y}
x1
FIGURE 6.1: Region of interest, VY , for computing the cdf of the random variable Y
dened as a sum of 2 independent random variables X1 and X2
where f (x1 , x2 ) is the joint pdf of X1 and X2 , and, most importantly, the
region over which the double integration is being carried out, VY , is given by:
VY = {(x1 , x2 ) : x1 + x2 y}
(6.44)
as shown in Fig 6.1. Observe from this gure that the integration may be
carried out several dierent ways: if we integrate rst with respect to x1 , the
limits go from until we reach the line, at which point x1 = y x2 ; we then
integrate with respect to x2 from to . In this case, Eq (6.43) becomes:
yx2
f (x1 , x2 )dx1 dx2
(6.45)
FY (y) =
(6.46)
179
If, instead, the integration in Eq (6.43) had been done rst with respect to x2
and then with respect to x1 , the resulting dierentiation would have resulted
in the alternative, and entirely equivalent, expression:
fY (y) =
(6.48)
Integrals of this nature are known as convolutions of the functions f1 (x1 ) and
f2 (x2 ) and this is as far as we can go with a general discussion.
Thus, we have the general result that the pdf of the random variable
Y obtained as a sum of two independent random variables X1 and X2 is a
convolution of the two contributing pdfs f1 (x1 ) and f2 (x2 ) as shown in Eqs
(6.47) and (6.48).
Let us illustrate this with a classic example.
Example 6.4 THE SUM OF TWO EXPONENTIAL RANDOM VARIABLES
Given two stochastically independent random variables X1 and X2 with
pdfs:
1
f1 (x1 ) = ex1 / ; 0 < x1 <
(6.49)
f2 (x2 ) =
1 x2 /
e
; 0 < x2 <
(6.50)
1
2
(6.51)
1 y/
ye
;0 < y <
2
(6.53)
Observe that the result presented above for the sum of two random variables extends directly to the sum of more than two random variables by successive additions. However, this procedure becomes rapidly more tedious as we
must carry out repeated convolution integrals over increasingly more complex
regions.
180
Random Phenomena
(6.54)
(6.55)
(6.56)
then
The utility of this result lies in the fact that Y (t) is easily obtained from each
contributing Xi (t); the desired fY (y) is then recovered from Y (t) either by
inspection (when this is obvious), or else by the inversion formula presented
in Chapter 4.
Let us illustrate this with the same example used above.
Example 6.5 THE SUM OF TWO EXPONENTIAL RANDOM VARIABLES REVISITED
Using characteristic functions, determine the pdf of the random variable
Y = X1 + X2 , where the pdfs of the two stochastically independent random variables X1 and X2 are as given in Example 6.4 above and their
characteristic functions are given as:
X1 (t) = X2 (t) =
1
(1 jt)
(6.57)
Solution:
From Eq (6.54), the required characteristic function for the sum is:
Y (t) =
1
(1 jt)2
(6.58)
At this point, anyone familiar with specic random variable pdfs and
their characteristic functions will recognize this particular form right
away: it is the pdf of a gamma random variable, specically (2, ),
as Chapter 9 shows. However, since we have not yet introduced these
important random variables, their pdfs and characteristic functions (see
Chapter 9), we therefore do not expect the reader to be able to deduce
the pdf corresponding to Y (t) above by inspection. In this case we can
invoke the inversion formula of Chapter 4 to obtain:
1
ejyt Y (t)dt
fY (y) =
2
1
ejyt
=
dt
(6.59)
2 (1 jt)2
181
Upon carrying out the indicated integral, we obtain the nal result:
fY (y) =
1 y/
ye
;0 < y <
2
(6.60)
1
ex/ x1 ; 0 < x <
()
(6.61)
1
(1 jt)
(6.62)
Find the pdf of the random variable Y dened as the sum of the n
independent such random variables, Xi , each with dierent parameters
i but with the same parameter .
Solution:
The desired transformation is
Y =
n
Xi
(6.63)
i=1
n
i=1
Xi (t) =
1
(1 jt)
(6.64)
where = n
i=1 i . Now, by comparing Eq (6.62) with Eq (6.64), we
see immediately the important result that Y is also a gamma random
variable, with parameters and . Thus, this sum of gamma random
variables begets another gamma random variable, a result generally
known as the reproductive property of the gamma random variable.
182
6.3
Random Phenomena
Bivariate Transformations
(6.65)
Y = (X)
(6.66)
(6.67)
X = (Y)
(6.68)
x1
y1
x1
y2
x2
y1
x2
y2
(6.69)
183
1
x1 ex2 ; 0 < x2 <
(6.72)
() 1
Determine both the joint and the marginal pdfs for the two random
variables Y1 and Y2 obtained via the transformation:
f2 (x2 ) =
Y1 = X1 + X2
X1
Y2 =
X1 + X2
(6.73)
Solution:
First, by independence, the joint pdf for X1 and X2 is:
1
x1 x11 ex1 ex2 ; 0 < x1 < ; 0 < x2 <
()() 1
(6.74)
Next, observe that the transformation in Eq (6.73) is a one-to-one
mapping of VX , the positive quadrant of the x1 x2 plane, onto
VY = {(y1 , y2 ); 0 < y1 < , 0 < y2 < 1}; the inverse transformation is
given by:
fX (x1 , x2 ) =
x1
y1 y2
x2
y1 (1 y2 )
y1
= y1
y1
(6.75)
(6.76)
1
1
[y1 (1 y2 )]1 ey1 y1 ; 0 < y1 < ;
()() (y1 y2 )
fY (y1 , y2 ) =
0 < y2 < 1;
0
otherwise
(6.77)
This may be rearranged to give:
1
1
0
otherwise
(6.78)
an equation which, apart from the constant, factors out into separate
and distinct functions of y1 and y2 , indicating that the random variables
Y1 and Y2 are independent.
By denition, the marginal pdf for Y2 is obtained by integrating out
y1 in Eq (6.78) to obtain
1
ey1 y1+1 dy1
(6.79)
f2 (y2 ) =
y21 (1 y2 )1 )
()()
0
Recognizing the integral as the gamma function, i.e.,
(a) =
ey y a1 dy
0
(6.80)
184
Random Phenomena
we obtain:
f2 (y2 ) =
( + ) 1
y
(1 y2 )1 ; 0 < y2 < 1
()() 2
(6.81)
Since, by independence,
fY (y1 , y2 ) = f1 (y1 )f2 (y2 )
(6.82)
it follows from Eqs (6.78), (6.71) or (6.72), and Eq (15.82) that the
marginal pdf for Y1 is given by:
f1 (y1 ) =
1
ey1 y1+1 ; 0 < y1 <
( + )
(6.83)
6.4
1 (X1 , X2 , . . . , Xn );
2 (X1 , X2 , . . . , Xn );
... =
Ym =
...
m (X1 , X2 , . . . , Xn )
Y1
Y2
6.4.1
Square Transformations
1 (y1 , y2 , . . . , yn );
2 (y1 , y2 , . . . , yn );
... =
xn =
...
n (y1 , y2 , . . . , yn )
x1
x2
(6.84)
(6.85)
x1
yn
..
.
x2
yn
..
.
xn
yn
185
(6.86)
And now, as in the bivariate case, it can be shown that for a J that is non-zero
anywhere in VY , the desired joint pdf for Y is given by:
fY (y) = fX [(y)]|J|; y VY
(6.87)
an expression that is identical in every way to Eq (14.32) except for the dimensionality, and similar to the single variate result in Eq (6.23). Thus for
the square transformation in which n = m, the required result is a direct
generalization of the bivariate result, identical in structure, diering only in
dimensionality.
6.4.2
Non-Square Transformations
186
Random Phenomena
2
x2
1
f2 (x2 ) = e 2 ; < x2 <
2
(6.89)
determine the pdf of the random variable Y obtained from their sum,
Y = X1 + X2
(6.90)
Solution:
First, observe that even though this is a sum, so that we could invoke
earlier results to handle this problem, Eq (6.90) is also an underdetermined transformation from two dimensions in X1 and X2 to one in Y .
To square the transformation, let the variable in Eq (6.90) now be Y1
and add another one, say Y2 = X1 X2 , to give:
Y1
X1 + X2
Y2
X1 X2
(6.91)
x2
1
(y1 + y2 )
2
1
(y1 y2 )
2
(6.92)
2
x +x2
1
(6.93)
and from Eq (6.87), the joint pdf for Y1 and Y2 is obtained as:
fY (y1 , y2 ) =
1 1
e
2 2
2
y
1
4
2
y
2
4
And now, either by inspection (this is a product of two clearly identiable, separate and distinct functions of y1 and y2 , indicating that the
two variables are independent), or by integrating out y2 in Eq (6.95),
one easily obtains the required marginal pdf for Y1 as:
2
y1
1
f1 (y1 ) = e 4 ; < y1 <
2
(6.96)
In the next example we derive one more important result and illustrate the
seriousness of the requirement that the Jacobian of the inverse transformation
not vanish anywhere in VY .
187
x2
1
f2 (x2 ) = e 2 ; < x2 <
2
(6.98)
determine the pdf of the random variable Y obtained from their ratio,
Y = X1 /X2
(6.99)
Solution:
Again, because this is an underdetermined transformation, we must rst
augment it with another one, say Y2 = X2 , to give:
Y1
Y2
X1
X2
X2
(6.100)
The Jacobian,
x1
y1 y2
x2
y2
y2
J =
0
(6.101)
y1
= y2
1
(6.102)
1
e
2
2
x1 +x2
2
2
(6.103)
from where we now obtain the joint pdf for Y1 and Y2 as:
fY (y1 , y2 ) =
1
|y2 |e
2
2 2
2
y1 y2 +y2
2
< y1 < ;
(6.104)
The careful reader will notice two things: (i) the expression for fY involves not just y2 , but its absolute value |y2 |; and (ii) that we have
excluded the troublesome point y2 = 0 from the space VY . These two
points are related: to the left of the point y2 = 0, |y2 | = y2 ; to the
right, |y2 | = y2 , so that these two regions must be treated dierently in
evaluating the integral.
188
Random Phenomena
To obtain the marginal pdf for y1 we now integrate out y2 in Eq
(6.104) over the appropriate region in VY as follows:
0
(y12 +1)y22
(y12 +1)y22
1
2
2
f1 (y1 ) =
y2 e
dy2 +
y2 e
dy2
2
0
(6.105)
which simplies to:
1
1
; < y1 <
(6.106)
f1 (y1 ) =
(1 + y12 )
as the required pdf. It is important to note that in carrying out the
integration implied in (6.105), the nature of the absolute value function, |y2 |, naturally forced us to exclude the point y2 = 0 because it
made it impossible for us to carry out the integration from to
under a single integral. (Had the integral involved not |y2 |, but y2 , as
an instructive exercise, the reader should try to evaluate the resulting
integral from to . See Exercise 6.9.)
6.4.3
Non-Monotone Transformations
In general, when the multivariate transformation y = (x) may be nonmonotone but has a countable number of roots k, when written as the matrix
version of Eq (6.33), i.e.,
xi = 1
i (y) = i (y); i = 1, 2, 3, . . . , k
(6.107)
k
fX [i (y)]|Ji |
(6.108)
i=1
6.5
We have focussed attention in this chapter on the single problem of determining the pdf, fY (y), of a random variable Y that has been dened as
a function of another random variable, X, whose pdf fX (x) is known. As is
common with problems of such general construct, the approach used to determine the desired pdf depends on the nature of the random variable, as
well as the nature of the problem itselfin this particular case, the problem
189
REVIEW QUESTIONS
1. State, in mathematical terms, the problem of primary interest in this chapter.
2. What are the results of this chapter useful for?
3. In single variable transformations, where Y = (X) is given along with fX (x),
and fY (y) is to be determined, what is the dierence between the discrete case of
this problem and the continuous counterpart?
190
Random Phenomena
EXERCISES
6.1 The pdf of a random variable X is given as:
f (x) = p(1 p)x1 ; x = 1, 2, 3, . . . ,
(6.109)
1
X
(6.110)
(ii) Given that E(X) = 1/p, obtain E(Y ) and compare it to E(X).
6.2 Given the pdf shown in Eq (6.18) for the transformed variable, Y , i.e.,
fY (y) =
e(2 ) y
; y = 1, 2, 4, 8, . . .
(log2 y)!
fX (x) =
0;
elsewhere
(6.111)
Determine the pdf for the random variable Y obtained via the transformation
Y =
1 X/
e
(6.112)
Compare this result to the one obtained in Example 6.2 in the text.
6.4 Given a random variable, X, with the following pdf:
1
(x + 1); 1 < x < 1
2
fX (x) =
0;
elsewhere
(6.113)
191
(i) Determine the pdf for the random variable Y obtained via the transformation
Y = X2
(6.114)
Xi (t) = e[i (e
1)]
(6.116)
(i) Obtain the pdf fY (y) of the random variable Y dened as the sum of these two
random variables, i.e.,
Y = X1 + X2
(ii) Extend the result to a sum of n such random variables, i.e.,
Y = X1 + X2 + + Xn
with each distribution given in Eq (6.115). Hence, establish that the random variable
X also possesses the reproductive property illustrated in Example 6.6 in the text.
(iii) Obtain the pdf fZ (z) of the random variable Z dened as the average of n such
random variables, i.e.,
1
Z = (X1 + X2 + + Xn )
n
6.6 In Example 6.3 in the text, it was established that if the random variable X has
the following pdf:
2
1
(6.117)
fX (x) = ex /2 ; < x <
2
then the pdf for the random variable Y = X 2 is:
1
fY (y) = ey/2 y 1/2 ; 0 < y <
2
(6.118)
1
(1 j2t)1/2
(6.119)
by re-writing 2 as 21/2 , and as (1/2) (or otherwise), obtain the pdf fZ (z) of
the random variable dened as:
Z = X12 + X22 + Xr2
(6.120)
where the random variables, Xi , are all mutually stochastically independent, and
each has the distribution shown in Eq (6.117).
192
Random Phenomena
6.7 Revisit Example 6.8 in the text, but this time, instead of Eq (6.91), use the
following alternative squaring transformation,
Y2 = X2
(6.121)
(6.122)
APPLICATION PROBLEMS
6.10 In a commercial process for manufacturing the extruded polymer lm Mylar ,
each roll of the product is characterized in terms of its gage, the lm thickness,
X. For a series of rolls that meet the desired mean thickness target of 350 m, the
thickness of a section of lm sampled randomly from a particular roll has the pdf
(x 350)2
1
(6.123)
f (x) = exp
2i2
i 2
where i2 is the variance associated with the average thickness for each roll, i. In
reality, the product property that is of importance to the end-user is not so much
the lm thickness, or even the average lm thickness, but a roll-to-roll consistency,
quantied in terms of a relative thickness variability measure dened as
2
X 350
(6.124)
Y =
i
Obtain the pdf fY (y) that is used to characterize the roll-to-roll variability observed
in this product quality variable.
6.11 Consider an experimental, electronically controlled, mechanical tennis ball
launcher designed to be used to train tennis players. One such machine is positioned at a xed launch point, L, located a distance of 1 m from a wall as shown in
Fig 6.2. The launch mechanism is programmed to launch the ball in an essentially
straight line, at an angle that varies randomly according to the pdf:
c; 2 < < 2
(6.125)
f () =
0; elsewhere
where c is a constant. The point of impact on the wall, at a distance y from the
193
y
1
FIGURE 6.2: Schematic diagram of the tennis ball launcher of Problem 6.11
center, will therefore be a random variable whose specic value depends on . First
show that c = , and then obtain fY (y).
6.12 The distribution of residence times in a single continuous stirred tank reactor
(CSTR), whose volume is V liters and through which reactants ow at rate F
liters/hr, was established in Chapter 2 as the pdf:
f (x) =
1 x/
;0 < x <
e
(6.126)
where = V /F .
(i) Find the pdf fY (y) of the residence time, Y , in a reactor that is 5 times as large,
given that in this case,
Y = 5X
(6.127)
(ii) Find the pdf fZ (z) of the residence time, Z, in an ensemble of 5 reactors in
series, given that:
(6.128)
Z = X1 + X2 + + X5
where each reactors pdf is as given in Eq (6.126), with parameter, i ; i = 1, 2, . . . , 5.
(Hint: Use results of Examples 6.5 and 6.6).
(iii) Show that even if 1 = 2 = = 5 = for the ensemble of 5 reactors in series,
fZ (z) will still not be the same as fY (y).
6.13 The total number of aws (dents, scratches, paint blisters, etc) found on the
various sets of doors installed on brand new minivans in an assembly plant is a
random variable with the pdf:
f (x) =
e x
; x = 0, 1, 2, . . .
x!
(6.129)
The value of the pdf parameter, , depends on the door in question as follows:
= 0.5 for the driver and front passenger doors; = 0.75 for the two bigger midsection passenger doors, and = 1.0 for the fth, rear trunk/tailgate door. If the
total number of aws per completely assembled minivan is Y , obtain the pdf fY (y)
and from it, compute the probability of assembling a minivan with more than a total
number of 2 aws on all its doors.
6.14 Let the uorescence signals obtained from a test spot and the reference spot
194
Random Phenomena
1
x1 ex1 ; 0 < x1 <
() 1
(6.130)
f2 (x2 ) =
1
x1 ex2 ; 0 < x2 <
() 1
(6.131)
It is customary to analyze such microarray data in terms of the fold change ratio,
Y =
X1
X2
(6.132)
indicative of the fold increase (or decrease) in the signal intensity between test
and reference conditions. Show that the pdf of Y is given by:
f (y) =
( + )
y 1
; y > 0; > 0; > 0
()() (1 + y)+
(6.133)
(6.134)
in a range from 50 to 500 volts and 100 to 250 C. If the voltage output is subject
to random variability around the true value V , such that
1
(v V )2
exp
(6.135)
f (v) =
2V2
V 2
where the mean (i.e., expected) value for Voltage, E(V ) = V and the variance,
V ar(V ) = V2 , (i) Show that:
E(X)
V ar(X)
0.4V + 100
(6.136)
0.16V2
(6.137)
2
(ii) In terms of E(X) = X and V ar(x) = X
, obtain an expression for the pdf
fX (x) representing the variability propagated to the temperature values.
6.16 Propagation-of-errors studies are concerned with determining how the errors from one variable are transmitted to another when the two variables are related
according to a known expression. When the relationships are linear, it is often possible to obtain complete probability distribution functions for the dependent variable
given the pdf for the independent variable (see Problem 6.15). When the relationships are nonlinear, closed form expressions are not always possible; in terms of
general results, the best one can hope for are approximate expressions for the expected value and variance of the dependent variable, typically in a local region,
upon linearizing the nonlinear expression. The following is an application of these
principles.
One of the best known laws of bioenergetics, Kleibers law, states that the Resting Energy Expenditure of an animal, Q0 , (essentially the animals metabolic rate,
195
in kcal/day), is proportional to M 3/4 , where M is the animals mass (in kg). Specifically for mature homeotherms, the expression is:
Q0 = 70M 3/4
(6.138)
rst obtain the approximate linearized expression for Q0 when M = 75 kg, and
then determine E(Q0 ) and V ar(Q0 ) for a population with M = 12.5 kg under
these conditions.
196
Random Phenomena
Chapter 7
Application Case Studies I:
Probability
7.1
7.2
7.3
7.4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mendel and Heredity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 Background and Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.2 Single Trait Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.3 Single trait analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The First Generation Traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Probability and The Second Generation Traits . . . . . . . . . . . . . . . .
7.2.4 Multiple Traits and Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pairwise Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.5 Subsequent Experiments and Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
World War II Warship Tactical Response Under Attack . . . . . . . . . . . . . . . . .
7.3.1 Background and Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.2 Approach and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.3 Final Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
198
199
199
200
201
203
204
205
205
208
209
209
209
212
212
To many scientists and engineers, a rst encounter with the theory of probability in its modern axiomatic form often leaves the impression of a subject
matter so abstract and esoteric in nature as to be entirely suited to nothing
but the most contrived applications. Nothing could be further from the truth.
In reality, the application of probability theory features prominently in many
modern elds of study: from nance, economics, sociology and psychology to
various branches of physics, chemistry, biology and engineering, providing a
perfect illustration of the aphorism that there is nothing so practical as a
good theory.
This chapter showcases the applicability of probability theory through
two specic case studies involving real-world problems whose practical importance can hardly be overstated. The rst, Mendels deduction of the laws
of hereditythe basis for the modern science of geneticsshows how Mendel
employed probability (and the concept of stochastic independence) to establish the principles underlying a phenomenon which, until then, was considered
essentially unpredictable and hence not susceptible to systematic analysis.
The second is from a now-declassied US Navy study during World War
II and involves decision-making in the face of uncertainty, using past data. It
197
198
Random Phenomena
7.1
Introduction
The elegant, well-established and fruitful tree we now see as modern probability theory has roots that reach back to 16th and 17th century gamblers
and the very realand very practicalneed for reliable solutions to numerous
gambling problems. Referring to these gambling problems by the somewhat
less morally questionable term problems on games of chance, some of the
most famous and most gifted mathematicians of the day devoted considerable
energy rst to solving specic problems (most notably the Italian mathematician, Cardano, in the 16th century), and later to developing the foundational
basis for systematic mathematical analysis (most notably the Dutch scientist,
Huygens, and the French mathematicians, Pascal and Fermat, in the 17th
century). However, despite subsequent major contributions in the 18th century from the likes of Jakob Bernoulli (1654-1705) and Abraham de Moivre
(1667-1754), it was not until the 19th century, with the publication in 1812
of Laplaces book, Theorie Analytique des Probabilites, that probability theory moved beyond the mathematical analysis of games of chance to become
recognized as an important branch of mathematics in its own rightone with
broader applications to other scientic and practical problems such as statistical mechanics and characterization of experimental error.
The nal step in the ascent of probability theory was taken in the 20th
century with the development of the axiomatic approach. First expounded in
Kolmogorovs celebrated 1933 monograph (the English translation, Foundations of Probability Theory, was published in 1950), this approach, once and
for all, provided a rigorous and mathematically precise denition of probability that suciently generalized the theory and formalized its applicability to a
wide variety of random phenomena. Paradoxically, that probability theory in
its current modern form enjoys applications in such diverse areas as actuarial
science, economics, nance; genetics, medicine, psychology; engineering, manufacturing, and strategic military decisions, is attributable to Kolmogorovs
rigorous theoretical and precise formalism. Thus, even though it would be considered overly hyperbolic today (too much embroidery), placed in its proper
historic context, the following statement in Laplaces book is essentially true:
It is remarkable that a science which began with the consideration
of games of chance should have become the most important object
of human knowledge.
199
The two example case studies presented here illustrate just how important
probability theory and its application have become since the time of Laplace.
7.2
Heredity, how traits are transmitted from parent to ospring, has always
fascinated and puzzled mankind. This phenomenon, central to the propagation of life itself, with serious implications for the health, viability and survival of living organisms, remained mysterious and poorly understood until
the ground-breaking work by Gregor Mendel (1822-1884). Mendel, an Augustinian monk, arrived at his amazing conclusions by studying variations in pea
plants, using the garden of his monastery as his laboratory. As stated in an
English translation of the original paper published in 18661 . . . The object of
the experiment was to observe these variations in the case of each pair of differentiating characters, and to deduce the law according to which they appear
in successive generations. The experiments involved the careful cultivation
and testing of nearly 30,000 plants, lasting almost 8 years, from 1856 to 1863.
The experimental results, and their subsequent probabilistic analysis paved
the way for the modern science of genetics, but it was not recognized as such
right away. The work, and its monumental import, languished in obscurity
until the early 20th century when it was rediscovered and nally accorded its
well-deserved recognition.
What follows is an abbreviated discussion of the essential elements of
Mendels work and the probabilistic reasoning that led to the elucidation of
the mechanisms behind heredity and genetics.
7.2.1
The value and utility of any experiment are determined by the tness of
the material to the purpose for which it is used, and thus in the case before us
it cannot be immaterial what plants are subjected to experiment and in what
manner such experiment is conducted.
So wrote Mendel in motivating his choice of pea plants as the subject of
his now-famous set of experiments. The two primary factors that made pea
plants an attractive choice are:
1. Relatively fast rates of reproduction; and
1 Mendel, Gregor, 1866. Versuche u
ber Planzenhybriden. Verhandlungen des naturforschenden Vereines in Br
unn, Bd. IV f
ur das Jahr 1865, Abhandlungen, 347; rst translated into English by William Bateson in 1901 as Experiments in Plant Hybridization,:
see http://www.netspace.org./MendelWeb/.
200
Random Phenomena
2. Availability of many varieties, each producing denite and easy to characterize traits;
Before enumerating the specic traits that Mendel studied, it is important to
note in hindsight, that the choice of peas was remarkably fortuitous because
the genetic structure of peas is now known to be relatively simple. A more
complex genetic structure could have further obscured the fundamental principles with additional distracting details; and the deductive analysis required
to derive general laws governing heredity and genetics from this specic set of
results would have been far more dicult.
By tracking the following seven specic trait characteristics (with the variations manifested in each trait indicated in square brackets),
1. Seed Shape; [Round/Wrinkled]
2. Seed Albumen Color; [Yellow/Green]
3. Seed Coat (same as Flower); Color [Reddish/White]
4. Pod Form (or Texture); [Inated/Constricted]
5. Unripe Pod (or stalks) Color; [Green/Yellow]
6. Flower Position (on the stem); [Axial/Terminal]
7. Stem length; [Tall/Dwarfed]
Mendel sought to answer the following specic questions:
1. How are these seven traits transmitted from parents to osprings generation after generation?
2. Are there discernible patterns and can they be generalized?
Our discussion here is limited to two out of the many sets of experiments in
the original study:
1. Single trait experiments, in which individual traits and how they are
transmitted from parent to ospring in subsequent generations are
tracked one-at-a-time;
2. Multiple trait experiments, in which several traits and their transmission
are tracked simultaneously, specically focusing on pairwise experiments
involving two traits.
7.2.2
201
7.2.3
8,023
929
1,181
580
858
1,064
Yellow
Reddish
Inated
Green
Axial
Tall
19,959
7,324
Round
Seed Shape
(Round/Wrinkled)
Seed Alb Color
(Yellow/Green)
Seed Coat/Flower
(Reddish/White)
Pod Form
(Inated/Constricted)
Unripe Pod Color
(Green/Yellow)
Flower Position
(Axial/Terminal)
Stem Length
(Tall/Dwarfed)
Totals
Total
1st Generation
14,949
787
651
428
882
705
6,022
5,474
Dominant
5010
277
207
152
299
224
2,001
0.749
0.740
0.759
0.738
0.747
0.759
0.751
2nd Generation
Proportion
Dominant (D)
1,850
0.747
Recessive
Characteristics
TABLE 7.1:
0.251
0.260
0.241
0.262
0.253
0.241
0.249
Proportion
Recessive (r)
0.253
2.98:1
2.84:1
3.14:1
2.82:1
2.95:1
3.15:1
3.01:1
2.96:1
D : r Ratio
202
Random Phenomena
203
of rst generation plants (with only one trait uniformly on display) produce second generation plants displaying a variety entirely absent in the
homogenous rst generation? Or, alternatively, how did the missing trait
in the rst generation reappear in the second?
3. Second generation visible trait composition: What law governs the apparently constant numerical ratio with which the original parental traits
appear in the second generation? What is the true theoretical value
of this numerical ratio?
To answers these questions and elucidate the principles governing single trait
selection, Mendel developed the following concepts and demonstrated that
they were consistent with his experimental data:
1. The concept of Hereditary Factors:
The inheritance of each trait is determined by units or factors
(now called genes); these factors do not amalgamate, but are
passed on to osprings intact and unchanged;
An individual has two sets of such units or factors, inheriting one
set from each parent; thus each parent transmits only half of its
hereditary factors to each ospring;
Which of the two parental factors is inherited by an ospring is
purely a matter of chance.
2. The concept of Dominance/Recessiveness:
In heredity, one trait is always dominant over the other, this other
trait being the recessive one;
To show up, a dominant trait needs only one trait factor from
the parent; the recessive needs two;
A trait may not show up in an individual but its factor can still be
transmitted to the next generation.
Mendels postulate was that if these concepts are true, then one must obtain
the observed results; conversely one will obtain these results only if these
concepts are valid.
The First Generation Traits
To see how these concepts help resolve the rst problem, consider rst
the specic case of the seed shape: Let the factors possessed by the pure
round shaped parent be represented as RR, (each R representing one round
trait factor); similarly, let the factors of the pure wrinkled shaped parent be
represented as ww. In cross-fertilizing the round-seed plants with the wrinkledseed ones, each rst generation hybrid will have factors that are either Rw
or wR. And now, if the round trait is dominant over the wrinkled trait,
204
Random Phenomena
then observe that the entire rst generation will be all round, precisely as in
Mendels experiment.
In general, when a pure dominant trait with factors DD is cross-fertilized
with a pure recessive trait with factors rr, the rst generation hybrid will
have factors Dr or rD each one displaying uniformly the dominant trait, but
carrying the recessive trait. The concepts of hereditary factors (genes) and
of dominance thus enabled Mendel to resolve the problem of the uniform
display of traits in the rst generation; just as important, they also provided
the foundation for elucidating the principles governing trait selection in the
second and subsequent generations. This latter exercise is what would require
probability theory.
Probability and The Second Generation Traits
The key to the second generation trait manifestation is a recognition that
while each seed of the rst generation plants looks like the dominant roundseed type in the parental generation, there are some fundamental, but invisible, dierences: the parental generation has pure trait factors RR and ww; the
rst generation has two distinct trait factors: Rw (or wR), one visible (phenotype) because it is dominant, the other not visible but inherited nonetheless
(genotype). The hereditary but otherwise invisible trait is the key.
To analyze the composition of the second generation, the following is a
modernization of the probabilistic arguments Mendel used. First note that
the collection of all possible outcomes when cross-fertilizing two plants each
with a trait factor set Rw is given by:
= {RR, Rw , w R, ww}
(7.1)
(7.3)
205
for each trait combination, then from Vx and its pre-image in , Eq. (7.1),
the probability distribution function of the phenotypic manifestation random
variable, X, is given by
P (X = 0) = 1/4
P (X = 1) = 3/4
(7.4)
(7.5)
7.2.4
The discussion thus far has been concerned with single traits and the
principles governing their hereditary transmission. Mendels next task was to
determine whether these principles applied equally to trait pairs, and then in
general when several diverse characters are united in the hybrid by crossing. The key question to be answered does the transmission of one trait
interfere with another, or are they wholly independentrequired a series of
carefully designed experiments on a large number of plants.
Pairwise Experiments
The rst category of multiple trait experiments involved cross-fertilization
of plants in which the dierentiating characteristics were considered in pairs.
For the purpose of illustration, we will consider here only the very rst in this
series, in which the parental plants diered in seed shape and seed albumen
color. Specically, the seed plants were round and yellow (R,Y), while the
pollen plants were wrinkled and green (w,g). (To eliminate any possible systematic pollen or seed eect, Mendel also performed a companion series of
206
Random Phenomena
experiments in which the roles of seed and pollen were reversed.) The specic
question the experiments were designed to answer is this: will the transmission
of the shape trait interfere with color or will they be independent?
As in the single trait experiments, the rst generation of hybrids were
obtained by cross-fertilizing the pure round-and-yellow seed plants with pure
wrinkled-and-green ones; the second generation plants were obtained by crossfertilizing rst generation plants, and so on, with each succeeding generation
similarly obtained from the immediately preceding one.
First generation results: The rst generation of fertilized seeds (f1)
were all round and yellow like the seed parents. These results are denitely
reminiscent of the single trait experiments and appeared to conrm that the
principle of dominance extended to pairwise traits independently: i.e. that
the round shape trait dominance over the wrinkled, and the yellow color trait
dominance over the green held true in the pairwise experiments just as they
did in the single trait experiments. Shape did not seem to interfere with color,
at least in the rst generation. But how about the second generation? How
will the bivariate shape/color traits manifest, and how will this inuence the
composition of the second generation hybrids? The circumstances are clearly
more complicated and require more careful analysis.
Second generation: Postulate, Theoretical Analysis and Results:
Rather than begin with the experimental results and then wend our way
through the theoretical analysis required to explain the observations, we nd it
rather more instructive to begin from a postulate, and consequent theoretical
analysis, and proceed to compare the theoretical predictions with experimental
data.
As with the single trait case, let us dene the following random variables:
for shape,
0, Wrinkled
(7.6)
X1 =
1, Round
and for color,
X2 =
0, Green
1, Yellow
(7.7)
As obtained previously, the single trait marginal pdfs for second generation
hybrid plants are given by:
1/4; x1 = 0
f1 (x1 ) =
(7.8)
3/4; x1 = 1
for shape, and, similarly, for color,
f2 (x2 ) =
1/4; x1 = 0
3/4; x1 = 1
(7.9)
207
TABLE 7.2:
(7.10)
(7.11)
Consider rst the simplest postulate that multiple trait transmissions are
independent. If this is true, then by denition of stochastic independence
the joint pdf will be given by:
f (x1 , x2 ) = f1 (x1 )f2 (x2 )
(7.12)
The fertilized seeds appeared round and yellow like those of the
seed parents. The plants raised therefrom yielded seeds of four
sorts, which frequently presented themselves in one pod. In all,
556 seeds were yielded by 15 plants, and of these there were:
315 round and yellow,
101 wrinkled and yellow,
108 round and green,
32 wrinkled and green.
208
Random Phenomena
TABLE 7.3:
Theoretical versus
experimental results for second generation
hybrid plants
Phenotype Theoretical Experimental
Distribution Frequencies
Trait
(R,Y)
0.56
0.57
0.19
0.18
(w,Y)
0.19
0.19
(R,g)
(w,g)
0.06
0.06
Since the experimental results matched the theoretical predictions remarkably well, the conclusion is that indeed the transmission of color is independent
of the transmission of the shape trait.
7.2.5
Almost a century and a half after, with probability theory now a familiar
xture in the scientic landscape, and with the broad principles and consequences of genetics part of popular culture, it may be dicult for modern
readers to appreciate just how truly revolutionary Mendels experiments and
his application of probability theory were. Still, it was the application of probability theory that made it possible for Mendel to predict, ahead of time, the
ratio between phenotypic (visible) occurrences of dominant traits and recessive traits that will arise from a given set of parent genotype (hereditary)
traits (although by todays standards all this may now appear routine and
straightforward). By thus unraveling the mysteries of a phenomenon that was
essentially considered unpredictable and due to chance alone plus some vague
209
7.3
Unlike in the previous example where certain systematic underlying biological mechanisms were responsible for the observations, one often must
deal with random phenomena for which there are no such easily discernible
mechanisms behind the observations. Such is the case with the problem we
are about to discuss, involving Japanese suicide bomber plane attacks during
World War II. It shows how historical data sets were used to estimate empirical conditional probabilities and these probabilities subsequently used to
answer a very important question with signicant consequences for US Naval
operations at that crucial moment during the war.
7.3.1
During World War II, US warships attacked by Japanese kamikaze pilots had two mutually exclusive tactical options: sharp evasive maneuvers to
elude the attacker and confound his aim, making a direct hit more dicult
to achieve; or oensive counterattack using anti-aircraft artillery. The two options are mutually exclusive because the eectiveness of the counter attacking
aircraft guns required maintaining a steady coursepresenting an easier target for the incoming kamikaze pilot; sharp maneuvering warships on the other
hand are entirely unable to aim and deploy their anti-aircraft guns with much
eectiveness. A commitment to one option therefore immediately precludes
the other.
Since neither tactic was perfectly eective in foiling kamikaze attacks, and
since dierent types of warships cruisers, air craft carriers, destroyers, etc
appeared to experience varying degrees of success with the dierent options,
naval commanders needed a denitive, rational system for answering the question: When attacked by Japanese suicide planes, what is the appropriate tactic
for a US Warship, evasive maneuvers or oensive counterattack?
210
Random Phenomena
TABLE 7.4:
7.3.2
211
P (H E)
60/365
=
= 0.333
P (E)
180/365
(7.14)
(or simply directly from the rightmost column in the table). Also,
P (H|C) = 62/185 = 0.335
(7.15)
(7.16)
(7.17)
so that it appears as if small ships have a slight edge in surviving the attacks,
regardless of tactics employed. But it is possible to rene these probabilities
further by taking both size and tactics into consideration, as follows:
For large ships, we obtain
P (H|L E) = 8/36 = 0.222
(7.18)
(7.19)
where we see the rst clear indication of an advantage: large ships making
evasive maneuvers are more than twice as eective in avoiding hits as their
counterattacking counterparts.
For small ships,
P (H|S E) = 52/144 = 0.361
P (H|S C) = 32/124 = 0.258
(7.20)
(7.21)
212
Random Phenomena
and while the advantage is not nearly as dramatic as with large ships, it is
still quite clear that small ships are more eective when counterattacking.
The nal recommendation is now clear:
7.3.3
Final Comments
In hindsight these conclusions from the Naval study are perfectly logical,
almost obvious; but at the time, the stakes were high, time was of the essence
and nothing was clear or obvious. It is gratifying to see the probabilistic
analysis bring such clarity and yield results that are in perfect keeping with
common sense after the fact.
7.4
This chapter, the rst of a planned trilogy of case studies, (see Chapters 11
and 20 for the others) has been concerned with demonstrating the application
of probability theory to two specic historical problems, each with its own
signicant practical implication that was probably not evident at the time the
work was being done. The rst, Mendels ground-breaking work on genetics,
is well-structured, and showed how a good theory can help clarify confusing
experimental data. The second, the US Navys analysis of kamikaze attack
data during WW II, is less structured. It demonstrates how data, converted
to empirical probabilities, can be used to make appropriate decisions. Viewed
from this distance in time, and from the generally elevated heights of scientic
sophistication of today, it will be all too easy to misconstrue these applications
as quaint, if not trivial. But that will be a gross mistake. The signicance of
these applications must be evaluated within the context of the time in history
when the work was done, vis-`
a-vis the tools available to the researchers at the
time. The US Naval application saved lives and irreversibly aected the course
of the the war in the Pacic theater. Mendels result did not just unravel a
vexing 19th century mystery; it changed the course of biological research for
good, even though it was not obvious at the time. All these were made possible
by the appropriate application of probability theory.
Part III
Distributions
213
215
216
Chapter 8
Ideal Models of Discrete Random
Variables
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Discrete Uniform Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 Basic Characteristics and Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Bernoulli Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1 Basic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.3 Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . . . .
The Hypergeometric Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.1 Basic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.3 Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.1 Basic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.3 Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . . . .
Relation to other random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Extensions and Special Cases of the Binomial Random Variable . . . . . . . .
8.6.1 Trinomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Some Important Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.2 Multinomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.3 Negative Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
8.6.4 Geometric Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Poisson Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.1 The Limiting Form of a Binomial Random Variable . . . . . . . . . . . . .
Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.2 First Principles Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.3 Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Standard Poisson Phenomena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Overdispersed Poisson-like Phenomena and the Negative Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
218
219
219
220
220
221
221
222
222
222
223
223
224
224
225
225
226
226
227
228
230
230
230
230
231
231
231
232
232
233
234
234
234
234
235
236
236
237
237
237
239
239
240
242
243
217
218
Random Phenomena
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
All these constructions and the laws connecting them
can be arrived at by the principle of looking for
the mathematically simplest concepts and the link between them.
Albert Einstein (18791955)
Having presented the probability distribution function, f (x), as our mathematical function of choice for representing the ensemble behavior of random
phenomena, and having examined the properties and characteristics of the
generic pdf extensively in the last four chapters, it now remains to present
specic probability distribution functions for some actual real-world phenomena of practical importance. We do this in each case by starting with all the
relevant information about the phenomenological mechanism behind the specic random variable, X; and, in much the same way as for deterministic
phenomena, we derive the expression for the pdf f (x) appropriate to the random phenomenon in question. The end result is several ideal models of random
variability, presented as a collection of probability distribution functions, each
derived directly fromand hence explicitly linked tothe underlying random
phenomenological mechanisms.
This chapter and the next one are devoted to the development and analysis of such models for some important random variables that are commonly
encountered in practice, beginning here with discrete random variables.
8.1
Introduction
As articulated briey in the prelude chapter (Chapter 0), it is entirely possible to develop, from rst-principles phenomenological considerations, appropriate theoretical characterizations of the variability inherent to random phenomena. Two primary benets accrue from this rst-principles approach:
1. It acquaints the reader with the mechanistic underpinnings of the random variable and the genesis of its pdf, making it less likely that the
reader will inadvertently misapply the pdf to a problem to which it is
unsuited. The single most insidious trap into which unsuspecting engineers and scientists often fall is that of employing a pdf inappropriately
to try and solve a problem requiring a totally dierent pdf: for example, attempting to use the (continuous) Gaussian pdfsimply out of
familiarityinappropriately to tackle a problem involving the (discrete)
phenomenon of the number of occurrences of safety incidents in a manufacturing site, a natural Poisson random variable.
219
2. It demonstrates the principles and practice of how one goes about developing such probability models, so that should it become necessary
to deal with a new random phenomenon with no pre-existing canned
model, the reader is able to fall back on rst-principles to derive, with
condence, the required model.
The modeling exercise begins with a focus on discrete random variables
rst in this chapter, and continuous random variables next in the following
chapter. In developing these models, we will draw on ideas and concepts discussed in earlier chapters about random variables, probability, probability
distributions, the calculus of probability, etc., and utilize the following model
development and analysis strategy:
We start from the simplest possible random variable and build up from
there, presenting some results without proof or else leaving such proofs as
exercises to the reader where appropriate.
8.2
8.2.1
220
Random Phenomena
f (xi ) =
1
k;
i = 1, 2, . . . , k;
0;
otherwise
(8.1)
with the random variable earning its name because f (x) is uniform across
the valid range of admissible values. Thus, Eq (8.1) is the pdf for the discrete
Uniform random variable, UD (k). The only characteristic parameter is k, the
total number of elements in the sample space. Sometimes the k elements are
indexed to include 0, i.e. i = 0, 1, 2, . . . k 1, (allowing easier connection to
the case where k = 2 and the only two outcomes are the binary numbers 0,
1). Under these circumstances, the mean and variance are:
= E(X) =
k
2
(8.2)
and
2 =
8.2.2
(k + 1)(k 1)
12
(8.3)
Applications
8.3
8.3.1
221
8.3.2
Model Development
= (1 p)
= p
(8.5)
(8.6)
(8.7)
222
Random Phenomena
(8.9)
The pdf for the Bernoulli random variable is then given by the more compact:
f (x) = pIS (1 p)IF
(8.10)
8.3.3
The following are important characteristics of the Bernoulli random variable, Bn(p), and its pdf:
1. Characteristic parameter: p; the probability of success.
2. Mean: = E(X) = p.
3. Variance: 2 = p(1 p); or =
p(1 p)
8.4
8.4.1
223
8.4.2
Model Development
The Sample Space: After each experiment, the outcome i is the ntuple
i = [a1 , a2 , . . . , an ]i
(8.12)
Thus,
(8.14)
Nd N Nd
f (x) =
Nnx
(8.15)
(8.16)
224
Random Phenomena
8.4.3
8.4.4
Applications
This random variable and its pdf model nd application mostly in acceptance sampling. The following are a few examples of such applications.
Example 8.1 APPLICATION OF THE HYPERGEOMETRIC
MODEL
A batch of 20 electronic chips contains 5 defectives. Find the probability
that out of 10 selected for inspection (without replacement) 2 will be
found defective.
Solution:
In this case, x = 2, Nd = 5, N = 20, n = 10 and therefore:
f (x) = 0.348
(8.17)
(8.18)
so that there is a surprisingly high probability that the lot will be accepted even though 16% is defective. Perhaps the acceptance sampling
protocol needs to be re-examined.
8.5
8.5.1
225
8.5.2
Model Development
(8.19)
(8.20)
226
Random Phenomena
(8.21)
The Model: Given that the probability of success in each trial, P (S) = p,
and by default P (F ) = (1 p) = q, then by the independence of the n trials
in each experiment, the probability of the occurrence of the compound event
Ex dened above is:
P (Ex ) = px (1 p)nx
(8.22)
n
However, in the original sample space , there are x dierent such events in
which the sequence in i contains x successes and (n x) failures, where
n
n!
(8.23)
=
x
x!(n x)!
Thus, P (X = x) is the sum of all events contributing to the pre-image in
of the event that the random variable X takes on the value x; i.e.
n x
P (X = x) =
p (1 p)nx
(8.24)
x
Thus, the pdf for the binomial random variable, Bi(n, p), is:
f (x) =
8.5.3
n!
px (1 p)nx
x!(n x)!
(8.25)
The following are important characteristics of the binomial random variable, Bi(n, p) and its pdf:
1. Characteristic parameters: n, p; respectively, the number of independent trials in each experiment, and the probability of success in each
trial;
2. Mean: = E(X) = np;
3. Variance: 2 = np(1 p);
4. Moment generating function: M (t) = [pet + (1 p)]n ;
5. Characteristic function: (t) = [pejt + (1 p)]n ;
227
8.5.4
Applications
20
(0.05)(0.95)19 = 0.377
1
(8.29)
228
Random Phenomena
Example 8.4 APPLICATION OF THE BINOMIAL MODEL:
DESIGN
From the sales record of an analytical equipment manufacturing company, it is known that their sales reps typically make on average one
sale of a top-of-the-line near-infrared device for every 3 attempts. In
preparing a training manual for future sales reps, the company would
like to specify n, the smallest number of sales attempts each sales rep
should make (per week) such that the probability of scoring an actual
sale (per week) is greater than 0.8. Find n.
Solution:
This problem may also be idealized as involving a binomial Bi(n, p)
random variable in which p = 1/3 but for which n is an unknown to be
determined to satisfy a design criterion. Finding the probability of the
event of interest, (X 1), is easier if we consider the complementthe
event of making no sale at all (X = 0), i.e.
P (X 1) = 1 P (X = 0)
In this case, since
f (0) =
then, we want
1
n
2
3
n
2
> 0.8
3
(8.30)
(8.31)
(8.32)
Inference
A fundamental question about binomial random variables, and indeed all
random variables, centers around how the parameters indicated in the pdfs
may be determined from data. This is a question that will be considered later
in greater detail and in a broader context; for now, we consider the following
specic question as an illustration: Given data, what can we say about p?
In the particular case of a coin toss, this is a question about determining
the true probability of obtaining a head (or tail) given data from actual coin toss experiments; in the case of predicting the sex of babies, it is
about determining the probability of having a boy or girl given hospital birth
records; and, as discussed extensively in Chapter 11, in in-vitro fertilization,
it is determining from appropriate fertility clinic data, the probability that a
particular single embryo will lead to a successful pregnancy. The answer to
this specic question is one of a handful of important fundamental results of
probability theory.
Let X be the random variable representing the number of successes in n
independent trials, each with an equal probability of success, p, so that X/n
is the relative frequency of success.
229
(8.33)
1
,
k2
(8.34)
2
2 n2
(8.35)
since, by comparison of the left hand sides of these two equations, we obtain
k = n from which k is easily determined, giving rise to the RHS of the
inequality above. And now because we are particularly concerned with the
binomial random variable for which = np and 2 = np(1 p), we have:
P (|X np| n)
For every > 0, as n ,
lim
p(1 p)
n2
p(1 p)
n2
=0
(8.36)
(8.37)
(8.38)
(8.39)
Together, these two equations constitute one form of the Law of Large Numbers indicating, in this particular case, that the relative frequency of success
230
Random Phenomena
(the number of successes observed per n trials) approaches the actual probability of success, p, as n , with probability 1. Thus, for a large number of
trials:
x
p.
(8.40)
n
8.6
We now consider a series of random variables that are either direct extensions of the binomial random variable (the trinomial and general multinomial
random variables), or are special cases (the negative binomial and the geometric random variables).
8.6.1
Basic Characteristics
In direct analogy to the binomial random variable, the following basic
phenomenon underlies the trinomial random variable:
1. Each experiment consists of n independent trials under identical conditions;
2. Each trial produces exactly three mutually exclusive outcomes,
1 , 2 , 3 , (Good, Average, Poor; A, B, C; etc);
3. In each single trial, the probability of obtaining outcome 1 is p1 ; the
probability of obtaining 2 is p2 ; and the probability of obtaining 3 is
p3 , with p1 + p2 + p3 = 1;
4. The random variable of interest is the two-dimensional, ordered pair
(X1 , X2 ), where X1 is the number of times that outcome 1, 1 , occurs in
the n trials; and X2 is the number of times that outcome 2, 2 , occurs in
the n trials. (The third random variable, X3 , the complementary number
of times that outcome 3, 3 , occurs in the n trials, is constrained to be
given by X3 = n X1 X2 ; it is not independent.)
The Model
It is easy to show, following the same arguments employed in deriving the
binomial model, that the trinomial model is:
f (x1 , x2 ) =
n!
px1 px2 pnx1 x2 ,
x1 !x2 !(n x1 x2 )! 1 2 3
(8.41)
231
n nx
1
x1 =0 x2 =0
(8.43)
(8.44)
(8.45)
indicating marginal mgfs, which, when compared with the mgf obtained earlier
for the binomial random variable, shows that:
1. The marginal distribution of X1 is that of the Bi(n, p1 ) binomial random
variable;
2. The marginal distribution of X2 is that of the Bi(n, p2 ) binomial random
variable.
8.6.2
n!
px1 px2 . . . pxk k
x1 !x2 ! . . . xk ! 1 2
(8.47)
along with
k
i=1
xi = n
(8.48)
232
8.6.3
Random Phenomena
Basic Characteristics
The basic phenomenon underlying the negative binomial random variable
is very similar to that for the original binomial random variable dealt with
earlier:
1. Like the binomial random variable, each trial produces exactly two mutually exclusive outcomes S (success) and F (failure); the probability
of success in each trial, P (S) = p;
2. The experiment consists of as many trials as are needed to obtain k
successes, with each trial considered independent, and carried out under
identical conditions;
3. The random variable, X, is the number of failures before the k th
success. (Since the labels S and F are arbitrary, this could also
be considered as the number of successes before the k th failure if it
is more logical to consider the problem in this fashion, so long as we are
consistent with what we refer to as a success and its complement that
is referred to as the failure.)
Model Development
From the denition of the random variable, X, n, the total number of
trials required to obtain exactly k successes is X + k; and mechanistically, the
event X = x occurs as a combination of two independent events: (i) obtaining
x failures and k 1 successes in the rst x + k 1 trials and (ii) obtaining a
success in the (x + k)th trial. Thus:
P (X = x)
(8.49)
p
(8.50)
Thus, the model for the negative binomial random variable N Bi(k, p) is:
x+k1 k
f (x) =
p (1 p)x ; x = 0, 1, 2, . . .
(8.51)
k1
which is also sometimes written in the entirely equivalent form (see Exercise
8.10):
x+k1 k
(8.52)
f (x) =
p (1 p)x ; x = 0, 1, 2, . . .
x
233
(In some instances, the random variable is dened as the total number of
trials required to obtain exactly k successes; the discussion above is easily
modied for such a denition of X. See Exercise 8.10).
In the most general sense, the parameter k of the negative binomial random variable in fact does not have to be an integer. In most engineering
applications, however, k is almost always an integer. In honor of the French
mathematician and philosopher, Blaise Pascal (16231662), in whose work
one will nd the earliest mention of this distribution, the negative binomial
distribution with integer k is often called the Pascal distribution. When the
parameter k is real-valued, the pdf is known as the Polya distribution, in honor
of the Hungarian mathematician, George P
olya (18871985), and written as:
f (x) =
(x + k) k
p (1 p)x ; x = 0, 1, 2, . . .
(k)x!
(8.53)
(8.54)
(8.55)
(8.56)
and the pdf in Eqs (8.53) will coincide with that in Eq (8.51) or Eq (8.52).
Important Mathematical Characteristics
The following are important characteristics of the negative binomial random variable, N Bi(k, p), and its pdf:
1. Characteristic parameters: k, p; respectively, the target number of
successes, and the probability of success in each trial;
2. Mean: = E(X) =
3. Variance: 2 =
k(1p)
p
k(1p)
p2
kq
p ;
kq
p2 ;
234
Random Phenomena
so that
p=
k
k+
(8.58)
x
(x + k 1)!
(k 1)!x! 1 + k
(8.59)
8.6.4
Consider the special case of the negative binomial random variable with
k = 1; where the resulting random variable X is the number of failures before the rst success. It follows immediately from Eqn. (8.51) that the required
pdf in this case is:
f (x) = pq x ; x = 0, 1, 2, . . .
(8.60)
The Model
It is more common to consider the geometric random variable as the number of trials not failures required to obtain the rst success. It is easy
to see that this denition of the geometric random variable merely requires a
shift by one in the random variable discussed above, so that the pdf for the
geometric random variable is given by:
f (x) = pq x1 ; x = 1, 2, . . .
(8.61)
q
p2 ;
235
Applications
One of the most important applications of the geometric random variable
is in free radical polymerization where, upon initiation, monomer units add to
a growing chain, with each subsequent addition propagating the chain until
a termination event stops the growth. After initiation, each trial involves
either a propagation event (the successful addition of a monomer unit to
the growing chain), or a termination event, where the polymer chain is
capped to yield a dead polymer chain that can no longer add another
monomer unit. Because the outcome of each trial (propagation or termination) is random, the resulting polymer chains are of variable length; in fact,
the physical properties and performance characteristics of the polymer are
related directly to the chain length distribution. It is therefore of primary
interest to characterize polymer chain length distributions appropriately.
Observe that as described above, the phenomenon underlying free-radical
polymerization is such that each polymer chain length is precisely the total
number of monomer units added until the occurrence of the termination event.
Thus, if termination is considered a success, then the chain length is a geometric random variable. In polymer science textbooks (e.g. Williams, 19711),
chemical kinetics arguments are often used to establish what is referred to
as the most probable chain length distribution; the result is precisely the
geometric pdf presented here. In Chapter 10, we use maximum entropy considerations to arrive at the same results.
Example 8.5 APPLICATION OF THE GEOMETRIC DISTRIBUTION MODEL
From their prior history, it is known that the probability of a building
construction company recording an accident (minor or major) on any
day during construction is 0.2. (a) Find the probability of going 7 days
(since the last occurrence) before recording the 1st accident. (b) What
is the expected number of days before recording the 1st accident?
Solution:
This problem clearly involves the geometric random variable with p =
0.2. Thus
(a) the required P (X = 7) is obtained as:
f (7) = 0.2(0.8)6 = 0.05
(8.62)
1
= 5 days
0.2
(8.63)
Williams, Polymer Science and Engineering, Prentice Hall, NJ, 1971, pp58-59
236
8.7
Random Phenomena
The Poisson random variable is encountered in so many practical applications, ranging from industrial manufacturing to physics and biology, and even
in such military problems as the historic study of deaths by horse-kicks in the
Prussian army, and the German bombardment of London during World War
II.
We present here two approaches to the development of the probability
model for this important random variable: (i) as a limiting form of the binomial
(and negative binomial) random variable; (ii) from rst principles.
8.7.1
x)!
np=
(8.64)
1
f (x) = lim
n
x!(n x)!
n
n
#
$
n
1(1 1/n)(1 2/n) . . . (1 (x 1)/n) x 1 n
= lim
(8.65)
x
n
x!
1 n
Now, because:
n
1
n
n
x
lim 1
n
n
lim
(8.66)
(8.67)
237
the latter being the case because x is xed, f (x) therefore reduces to:
f (x) =
e x
; x = 0, 1, 2, . . .
x!
(8.68)
This is the pdf of the Poisson random variable, P(), with the parameter .
It is also straightforward to show (see Exercise 8.14), that the Poisson pdf
arises in the limit as k for the negative binomial random variable, but
with the mean kq/p = remaining constant; i.e., from Eq (8.59),
#
$
x
(x + k 1)!
e x
; x = 0, 1, 2, . . .
(8.69)
f (x) = lim
=
k
(k 1)!x! 1 + k
x!
k
8.7.2
Basic Characteristics
Considered from rst principles, the basic phenomenon underlying the
Poisson random variable is as follows:
1. The experiment consists of observing the number of occurrences of a
particular event (a success) in a given xed interval (of time, length,
space) or area, or volume, etc, of size z;
2. The probability, p, of observing exactly one success in a sub-interval
of size z units (where z is small), is proportional to z; i.e.
p = z
(8.70)
238
Random Phenomena
Model Development
We start by dening Px (z),
Px (z) = P (X = x in an interval of size z)
(8.71)
Px (z)(1 z)
Px1 (z)z
(8.73)
(8.74)
P (E3 ) =
O(z)
(8.75)
Hence,
Px (z + z) = Px (z)(1 z) + Px1 (z)z + O(z); x = 1, 2, . . . . (8.76)
In particular, for x = 0, we have:
P0 (z + z) = P0 (z)(1 z)
(8.77)
P0 (z)
O(z)
z
(8.78)
(8.79)
(8.80)
= P0 (z)
(8.81)
239
To solve these equations requires the following initial conditions: P0 (0), the
probability of nding no success in the interval of size 0 a certain event
is 1; Px (0), the probability of nding x successes in the interval of size 0
an impossible event is 0. With these initial conditions, we obtain, rst
for P0 (z), that
(8.82)
P0 (z) = ez
which we may now introduce into Eq. (8.80) and solve recursively for x =
1, 2, ... to obtain, in general (after some tidying up),
Px (z) =
(z)x e(z)
x!
(8.83)
Thus, from rst principles, the model for the Poisson random variable is given
by:
x e
(z)x ez
=
(8.84)
f (x) =
x!
x!
8.7.3
jt
1)]
1)]
n
is also a Poisson random variable, with parameter = i=1 i . Because a sum of Poisson random variables begets another Poisson random
variable, this characteristic is known as a reproductive property. This
result is easily established using the method of characteristic functions
discussed in Chapter 6. (See Exercise 8.17.)
240
Random Phenomena
8.7.4
Applications
e2 2x
x!
(8.87)
(8.88)
so that:
P (X > 2) = 1 (0.135 + 0.271 + 0.271) = 0.325
(8.89)
241
TABLE 8.1:
Theoretical
versus empirical frequencies for
inclusions data
Theoretical Empirical
f (x)
Frequency
x
0
0.3679
0.367
0.3679
0.383
1
0.1839
0.183
2
3
0.0613
0.017
4
0.0153
0.033
0.0031
0.017
5
6
0.0005
0.000
a ber optics wire is 1/1000, and that the probability of nding more
than one blemish in this foot-long unit is 0, determine the probability
of nding 5 blemishes in a 3,000 ft roll of wire.
Solution:
This problem involves a Poisson random variable with z = 1 foot; and
the intensity = 1/1000 per foot. For z = 3, 000 ft,
= z = 3.0
(8.90)
e3 35
= 0.101
5!
(8.91)
242
Random Phenomena
the discrepancies observed for values of x = 3, 4 and 5 are signicant
or not is a matter taken up in Part IV.
(2) The probability of obtaining 3 or fewer inclusions is computed
as follows:
3
f (x) = 0.981
(8.92)
P (X 3) = F (3) =
x=0
(8.93)
indicating a very small probability that this process will produce sheets
with 5 or more inclusions.
243
8.8
244
Random Phenomena
REVIEW QUESTIONS
1. What are the two primary benets of the rst principles approach to probability modeling as advocated in this chapter?
2. What are the four components of the model development and analysis strategy
outlined in this chapter?
3. What are the basic characteristics of the discrete uniform random variable?
4. What is the probability model for the discrete uniform random variable?
5. What are some examples of a discrete uniform random variable?
6. What are the basic characteristics of the Bernoulli random variable?
7. What is the connection between the discrete uniform and the Bernoulli random
variables?
Probability Model
Geometric
G(p)
Poisson
P()
np
npi
n, p
n, pi
k, p
1/p
= np
nNd
N
n, Nd , N
(1 p)/p2
npi (1 pi )
(q = 1 p)
np(1 p)
pq
(q = 1 p)
n
npq N
N 1
k(k+2)
12
(k1)(k+1)
12
N Bi(1, p) = G(p);
limk N Bi(k, p) = P()
k(1p)
=
p
N Bi(1, p)
= Bi(n, p)
limn Bi(n, p) = P()
np =
Marginal
fi (xi ) Bi(n, pi )
limN H(n, Nd , N )
UD (2) = Bi(0.5)
X
in Bn(p)
i=1 Xi Bi(n, p)
Variance ( 2 ) Relation to
V ar(X)
Other Variables
k+1
2
k
2
Mean ()
E(X)
Characteristic
Parameters
k
k
p
f (xi ) = k1 ; i = 1, 2, . . . , k
or i = 0, 1, 2, . . . , k 1
f (x = 0) = (1 p)
f (x = 1) = p
(Nd )(N Nd)
Hypergeometric
f (x) = x Nnx
(n)
H(n, Nd , N )
Binomial
f (x) = nx px (1 p)nx
x = 1, 2, . . . , n
Bi(n, p)
Multinomial
f (x1 , x2 , . . . , xk ) =
x1 x2
xk
n!
M N (n, pi )
1 p2 pk
x1 !x2 !xk ! p
xi = n; pi = 1
k
x
Negative Binomial f (x) = x+k+1
k1 p (1 p)
N Bi(k, p)
x = 1, 2, . . .
Random
Variable
Uniform
UD (k)
Bernoulli
Bn(p)
TABLE 8.2:
246
Random Phenomena
247
29. What is the probability model for the Poisson random variable?
30. The Poisson model is most appropriate for what sort of phenomena?
31. What about the mean and the variance of the Poisson random variable is a
distinguishing characteristic of this random variable?
32. What is an overdispersed Poisson-like phenomena? Give a few examples.
33. What probability model is more appropriate for overdispersed Poisson-like
phenomena and why?
EXERCISES
8.1 Establish the results given in the text for the variance, MGF and CF for the
Bernoulli random variable.
8.2 Given a hypergeometric H(n, Nd , N ) random variable X for which n = 5, Nd = 2
and N = 10:
(i) Determine and plot the entire pdf, f (x) for x = 0, 1, 2, . . . , 5
(ii) Determine P (X > 1) and P (X < 2)
(iii) Determine P (1 X 3)
8.3 A crate of 100 apples contains 5 that are rotten. A grocer purchasing the crate
selects a sample of 10 apples at random and accepts the entire crate only if this
sample contains no rotten apples. Determine the probability of accepting the crate.
If the sample size is increased to 20, nd the new probability of accepting the crate.
8.4 From the expression for the pdf of a binomial Bi(np) random variable, X, establish that E(X) = np and V ar(X) = npq where q = (1 p).
8.5 Establish that if X1 , X2 , . . . , Xn are n independent Bernoulli random variables,
then:
n
Xi
X=
i=1
nx
x+1
(8.94)
248
Random Phenomena
Use this to determine the value x for which the pdf attains a maximum. (Keep in
mind that because f (x) is not a continuous function of x, the standard calculus approach of nding optima by taking derivatives and setting to zero is invalid. Explore
the nite dierence f (x + 1) f (x) instead.)
8.8 Given the joint pdf for the two-dimensional ordered pair (X1 , X2 ) of the trinomial random variable (see Eq (8.41)), obtain the conditional pdfs f (x1 |x2 ) and
f (x2 |x1 ).
8.9 Consider a chess player participating in a two-game pre-tournament qualication
series. From past records in such games, it is known that the player has a probability
pW = 0.75 of winning, a probability pD = 0.2 of drawing, and a probability pL = 0.05
of losing. If X1 is the number of wins and X2 is the number of draws, obtain the
complete joint pdf f (x1 , x2 ) for this player. From this, compute the marginal pdfs,
f1 (x1 ) and f2 (x2 ), and nally obtain the conditional pdfs f (x1 |x2 ) and f (x2 |x1 ).
8.10 (i) Establish the equivalence of Eq (8.51) and Eq (8.52), and also the equivalence of Eq (8.53) and Eq (8.52) when k is a positive integer.
(ii) If the negative binomial random variable is dened as the total number of trials (not failures) required to obtain exactly k successes, obtain the probability
model in this case and compare it to the model given in Eq (8.51) or Eq (8.52).
8.11 Obtain the recursion formula
f (x + 1) = (k, x, p)f (x)
(8.95)
for the negative binomial pdf, showing an explicit expression for (k, x, p). Use this
expression to determine the value x for which the pdf attains a maximum. (See
comments in Exercise 8.7.) From this expression, conrm that the geometric distribution is monotonically decreasing.
8.12 (i) Establish that E(X) for the geometric random variable is 1/p and that
V ar(X) = q/p2 , where q = 1 p.
(ii) Given that for a certain geometric random variable, P (X = 2) = 0.0475 and
P (X = 10) = 0.0315, determine P (2 X 10).
(iii) The average chain length of a polymer produced in a batch reactor is given as
200 units, where chain length itself is known to be a geometric random variable.
What fraction of the polymer product is expected to have chains longer than 200
units?
8.13 The logarithmic series random variable possesses the distribution
f (x) =
px
; 0 < p < 1; x = 1, 2, . . . ,
x
(8.96)
1
ln(1 p)
(8.97)
and then establish the following mathematical characteristics of this random variable
and its pdf:
249
(8.98)
for the Poisson pdf, showing an explicit expression for (, x). Use this expression
to conrm that for all values of 0 < < 1, the Poisson pdf is always monotonically
decreasing. Find the value x for which the pdf attains a maximum for > 1. (See
comments in Exercise 8.7.)
8.16 (i) Obtain the complete pdf, f (x), for the binomial random variable with
n = 10, p = 0.05, for x = 0, 1, . . . , 10, and compare it to the corresponding pdf,
f (x), for a Poisson variable with = 0.5.
(ii) Repeat (i) for n = 20, p = 0.5 for the binomial random variable, and = 10 for
the Poisson random variable.
8.17 Show that if Xi , i = 1, 2, . . . , n, are n independent Poisson random variables
each with parameter i , then the random variable Y dened as:
Y =
n
Xi
i=1
n
i=1
i .
8.18 The number of yarn breaks per shift in a commercial ber spinning machine
is a Poisson variable with = 3. Determine the probability of not experiencing any
yarn break in a particular shift. What is the probability of experiencing more than
3 breaks per shift?
8.19 The probability of nding a single sh-eye gel particle (a solid blemish) on
a sq cm patch of a clear adhesive polymer lm is 0.0002; the probability of nding
more than one is essentially zero. Determine the probability of nding 3 or more
such blemishes on a 1 square meter roll of lm.
8.20 For a Poisson P() random variable, determine P (X 2) for = 0.5, 1, 2, 3.
Does the observed behavior of P (X 2) as increases make sense? Explain.
8.21 The number of eggs laid by a particular bird per mating season is a Poisson
random variable X, with parameter . The probability that any such egg successfully
develops into a hatchling is p (the probability that it does not survive is (1 p)).
Assuming mutual independence of the development of each egg, if Y is the random
250
Random Phenomena
variable representing the number of surviving hatchlings, establish that its probability distribution function is given by:
f (y) =
ep (p)y
y!
(8.99)
APPLICATION PROBLEMS
8.22 (i) A batch of 15 integrated-circuit chips contains 4 of an irregular type. If
from this batch 2 chips are selected at random, and without replacement, nd the
probability that: (a) both are irregular; (b) none is irregular; (c) only one of the
two is irregular.
(ii) If the random variable in problem (i) above was mistakenly taken to be a binomial random variable (which it is not), recalculate the three probabilities and
compare the corresponding results.
8.23 A pump manufacturer knows from past records that, in general, the probability
of a certain specialty pump working continuously for fewer than 2 years is 0.3; the
probability that it will work continuously for 2 to 5 years is 0.5, and the probability
that it will work for more than 5 years is 0.2. An order of 8 such pumps has just
been sent out to a customer: nd the probability that two will work for fewer than
2 years, ve will work for 2 to 5 years, and one will work for more than 5 years.
8.24 The following strategy was adopted in an attempt to determine the size, N ,
of the population of an almost extinct population of rare tigers in a remote forest
in southeast Asia. At the beginning of the month, 50 tigers were selected from the
population, tranquilized, tagged and released; assuming that a month is sucient
time for the tagged sample to become completely integrated with the entire population, at the end of the month, a random sample of n = 10 tigers was selected, two
of which were found to have tags.
(i) What does this suggest as a reasonable estimate of N ? Identify two key potential
sources of error with this strategy.
(ii) If X is the random variable representing the total number of tagged tigers found
in the sample of n taken at the end of the month, clearly, X is a hypergeometric
random variable. However, given the comparatively large size we would expect of N
(the unknown tiger population size), it is entirely reasonable to approximate X as
a binomial random variable with a probability of success parameter p. Compute,
for this (approximately) binomial random variable, the various probabilities that
X = 2 out of the sampled n = 10 when p = 0.1, p = 0.2 and p = 0.3. What does
this indicate to you about the more likely value of p, for the tiger population?
(iii) In general, for the binomial random variable X in (ii) above, given data that
x = 2 successes were observed in n = 10 trials, show that the probability that
X = 2 is maximized if p = 0.2.
8.25 The number of contaminant particles (aws) found on each standard size silicon wafer produced at a certain manufacturing site is a random variable, X, that
a quality control engineer wishes to characterize. A sample of 30 silicon wafers was
selected and examined for aws; the result (the number of aws found on each wafer)
251
1
0
4
2
0
1
3
2
1
2
3
2
1
0
2
2
3
5
4
2
3
0
1
1
1
2
1
(i) From this data set, obtain an empirical frequency distribution function, fE (x),
and compute E(X), the expected value of the number of aws per wafer.
(ii) Justifying your choice adequately but succinctly, postulate an appropriate theoretical probability model for this random variable. Using the result obtained in
(i) above for E(X), rounded to the nearest integer, compute from your theoretical
model, the probability that X = 0, 1, 2, 3, 4, 5 and 6, and compare these theoretical
probabilities to the empirical ones from (i).
(iii) Wafers with more than 2 aws cannot be sold to customers, resulting in lost revenue; and the manufacturing process ceases to be economically viable if more than
30% of the produced wafers fall into this category. From your theoretical model,
determine whether or not the particular process giving rise to this data set is still
economically viable.
8.26 An ensemble of ten identical pumps arranged in parallel is used to supply water
to the cooling system of a large, exothermic batch reactor. The reactor (and hence
the cooling system) is operated for precisely 8 hrs every day; and the data set shown
in the table below is the total number of pumps functioning properly (out of the
ten) on any particular 8-hr operating-day, for the entire month of June.
Generate a frequency table for this data set and plot the corresponding histogram. Postulate an appropriate probability model. Obtain a value for the average
number of pumps out of 10 that are functioning every day; use this value to obtain an estimate of the model parameters; and from this compute a theoretical pdf.
Compare the theoretical pdf with the relative frequency distribution obtained from
the data and comment on the adequacy of the model.
Available
Pumps
9
10
9
8
7
7
9
7
7
8
Day
(in June)
1
2
3
4
5
6
7
8
9
10
Available
Pumps
6
8
9
9
8
9
7
7
7
5
Day
(in June)
11
12
13
14
15
16
17
18
19
20
Available
Pumps
8
9
4
9
9
10
8
5
8
8
Day
(in June)
21
22
23
24
25
26
27
28
29
30
8.27 In a study of the failure of pumps employed in the standby cooling systems of
commercial nuclear power plants, Atwood (1986)2 determined as 0.16 the probability
that a pump selected at random in any of these power plants will fail. Consider a
2 Atwood, C.L., (1086). The binomial failure rate common cause model, Technometrics,
28, 139-148.
252
Random Phenomena
system that employs 8 of these pumps but which, for full eectiveness, really requires
only 4 to be functioning at any time.
(i) Determine the probability that this particular cooling system will function with
full eectiveness.
(ii) If a warning alarm is set to go o when there are ve or fewer pumps functioning
at any particular time, what is the number of times this alarm is expected to go o
in a month of 30 days? State any assumptions you may need to make.
(iii) If the probability of failure increases to 0.2 for each pump, what is the percentage increase in the probability that four or more pumps will fail?
8.28 The table below contains data from Greenwood and Yule, 19203 , showing the
frequency of accidents occurring, over a ve-week period, to 647 women making high
explosives during World War I.
Number
of Accidents
0
1
2
3
4
5+
Observed
Frequency
447
132
42
21
3
2
(i) For this clearly Poisson-like phenomenon, let X be the random variable representing the number of accidents. Determine the mean and the variance of X. What
does this indicate about the possibility that this may in fact not be a true Poisson
random variable?
(ii) Use the value computed for the data average as representative of for the Poisson
distribution and obtain the theoretical Poisson model prediction of the frequency of
occurrences. Compare with the observed frequency.
(iii) Now consider representing this phenomenon as a negative binomial random
variable. Determine k and p from the computed data average and variance; obtain a
theoretical prediction of the frequency of occurrence based on the negative binomial
model and compare with the observed frequency.
(iv) To determine, objectively, which probability model provides a better t to this
data set, let fio represent the observed frequency associated with the ith group, and
let i represent the corresponding theoretical (expected) frequency. For each model,
compute the index,
m
(fio i )2
C2 =
(8.100)
i
i=1
(For reasons discussed in Chapter 17, it is recommended that group frequencies
should not be smaller than 5; as such, the last two groups should be lumped into
one group, x 4.) What do the results of this computation suggest about which
model provides a better t to the data? Explain.
8.29 Sickle-cell anemia, a serious condition in which the body makes sickle-shaped
3 Greenwood M. and Yule, G. U. (1920) An enquiry into the nature of frequency distributions representative of multiple happenings with particular reference of multiple attacks
of disease or of repeated accidents. Journal Royal Statistical Society 83:255-279.
253
red blood cells, is an inherited disease. People with the disease inherit two copies of
the sickle cell geneone from each parent. On the other hand, those who inherit only
one sickle cell gene from one parent and a normal gene from the other parent have
a condition called sickle cell trait; such people are sometimes called carriers
because while they do not have the disease, they nevertheless carry one of the
genes that cause it and can pass this gene to their children.
Theoretically, if two sickle-cell carriers marry, the probability of producing
an ospring with the sickle-cell disease is 0.25, while the probability of producing
osprings who are themselves carriers is 0.5; the probability of producing children
with a full complement of normal genes is 0.25.
(i) If a married couple who are both carriers have four children, what is the joint
probability distribution of the number of children with sickle cell anemia and the
number of carriers?
(ii) From this joint probability distribution, determine the probabilities of having
(a) no children with the disease and 2 carriers; (b) 1 child with the disease and 2
carriers; (c) two children with the disease and 2 carriers.
(iii) On the condition that exactly one of the 4 children is a carrier, determine the
probability of having (a) no children with the disease; (b) 1 child with the disease;
(c) 2 children with the disease, and (d) 3 children with the disease.
8.30 Revisit Application Problem 8.29 above and now consider that a married couple, both of whom are carriers, lives in a country where there is no health care
coverage, so that each family must cover its own health care costs. The couple, not
knowing that that are both carriers, proceed to have 8 children. A child with the
sickle-cell disease will periodically experience episodes called crises that will require hospitalization and medication for the symptoms (there are no cures yet for the
disease). Suppose that it costs the equivalent of US$2,000 a year in general hospital
costs and medication to treat a child with the disease.
(i) What annual sickle-cell disease-related medical cost can this family expect to
incur?
(ii) If these crisis episodes occur infrequently at an average rate of 1.5 per year
in this country (3 every two years), and these occurrences are well-modeled as a
Poisson-distributed random variable, what is the probability that this family will
have to endure a total of 3 crisis episodes in one year? Note that only a child with
the disease can have a crisis episode. (Hint: See Exercise 8.21.)
8.31 When a rare respiratory disease with a long incubation period infects a population of people, there is only a probability of 1/3 that an infected patient will
show the symptoms within the rst month. When ve such symptomatic patients
showed up in the only hospital in a small town, the astute doctor who treated these
patients knew immediately that more will be coming in the next few months as the
remaining infected members of the population begin to show symptoms. Assume
that all symptomatic patients will eventually come to this one hospital.
(i) Postulate an appropriate probability model and use it to determine the most
likely number of infected but not yet symptomatic patients, where x is considered
the most likely number if P (X = x ) is the highest for all possible values of x.
(ii) Because of the nature of the disease, if a total of more than 15 people are infected,
the small town will have to declare a state of emergency. What is the probability of
254
Random Phenomena
255
Jan 28, 1986 indicent, discuss which estimate of the probability the catastrophes
occurrence is more believable?
8.35 A study in Kalbeisch et al., 19914 reported that the number of warranty
claims for one particular system on a particular automobile model within a year
of purchase is well-modeled as a Poisson distributed random variable, X, with an
average rate of = 0.75 claims per car.
(i) Determine the probability that there are two or fewer warranty claims on this
specic system for a car selected at random.
(ii) Consider a company that uses the warranty claims on the various systems of
the car within the rst year of purchase to rate cars for their initial quality. This
company wishes to use the Kalbeisch et al. results to set an upper limit, xu , on
the number of warranty claims whereby a car is declared of poor initial quality
if the number of claims equals or exceeds this number. Determine the value of xu
such that, given = 0.75, the probability of purchasing a car which, by pure chance
alone, will generate more than xu warranty claims, is 0.05 or less.
8.36 In the 1940s, the entomologist S. Corbet catalogued the butteries of Malaya
and obtained the data summarized in the table below. The data table shows x, a
count of the number of species, x = 1, 2, . . . 24, and the associated actual number
of butteries caught in light-traps in Malaya that have x number of species. For
example, there were 118 single-species butteries; 74 two-species butteries (for a
total of 148 of such butteries) etc, and the last entry indicates that there were 3
categories of 24-species butteries. Corbet later approached the celebrated R. A.
Fisher for assistance in analyzing the data. The result, presented in Fisher et al.,
19435 , is a record of how the logarithmic series distribution (see Exercise 8.13) was
developed as a model for describing species abundance.
Given the characteristics of the logarithmic series distribution in Exercise 8.13,
obtain an average for the number of species, x, and use this to obtain a value for
4 Kalbeisch, J.D., Lawless, J.F., and Robinson, J.A. (1991). Methods for the analysis
and prediction of warranty claims. Technometrics, 33, 273285.
5 Fisher, R. A., S. Corbet, and C. B. Williams. (1943). The relation between the number
of species and the number of individuals in a random sample of an animal population.
Journal of Animal Ecology, 1943: 4258.
256
Random Phenomena
x
No of species
(x)
Observed Frequency
x
No of species
(x)
Observed Frequency
x
No of species
(x)
Observed Frequency
118
74
44
24
29
22
20
19
10
11
12
13
14
15
16
20
15
12
14
12
17
18
19
20
21
22
23
24
10
10
11
Chapter 9
Ideal Models of Continuous Random
Variables
9.1
9.2
9.3
259
259
260
261
261
262
264
264
265
266
268
271
271
271
272
272
272
273
274
275
275
276
278
279
279
287
288
289
290
292
292
292
293
296
297
298
300
300
300
301
301
302
302
303
257
258
Random Phenomena
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Extensions and Special Cases of the Beta Random Variable . . . . .
Generalized Beta Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Inverted Beta Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.3 The (Continuous) Uniform Random Variable . . . . . . . . . . . . . . . . . . . .
Basic Characteristics, Model and Remarks . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.4 Fishers F Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics, Model and Remarks . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.5 Students t Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics, Model and Remarks . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.6 The Cauchy Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics, Model and Remarks . . . . . . . . . . . . . . . . . . . . .
Important Mathematical Characteristics . . . . . . . . . . . . . . . . . . . . . . .
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.2
9.4
303
306
307
307
308
308
308
309
309
309
310
310
311
311
312
314
314
314
315
316
316
317
323
329
259
tance, to facilitate the discussion and also promote fundamental understanding, these random variables and their pdfs will be presented in families
cohort groups that share common structural characteristics. That the starting point of the derivations for the most basic of these families of continuous
random variables is a discrete random variable may be somewhat surprising at rst, but this is merely indicative of the sort of intriguing connections
(some obvious, others not) between these random variablesboth continuous
and discrete. A chart included at the end of the chapter summarizes these
connections and places in context how all the random variables discussed in
these two chapters are related to one another.
9.1
Our discussion of continuous random variables and their probability models begins with the Gamma Family whose 4 distinct members, from the
simplest (in terms of underlying phenomena) to the most complex, are:
These random variables are grouped together because they share many
common structural characteristics, the most basic being non-negativity: they
all take values restricted to the positive real line, i.e. 0 < x < . Not surprisingly, they all nd application in problems involving intrinsically non-negative
entities. Specically, 3 of the 4 (Exponential, gamma, and Weibull) frequently
nd application in system reliability and lifetime studies, which involve waiting times until some sort of failure. Structurally, these three are much closer
together than the fourth, Chi-square, which nds application predominantly in
problems involving a dierent class of non-negative variables: mostly squared
variables including variances. Its membership in the family is by virtue of being a highly specialized (and somewhat unusual) case of the gamma random
variable.
260
9.1.1
Random Phenomena
(9.1)
(t)y et
; y = 0, 1, 2, . . . ,
y!
(9.2)
and since P [Y (t) < 1] = P [Y (t) = 0], we obtain from Eq. (9.2) that the
expression in Eq (9.1) immediately becomes:
P [T > t] = et , or
1 FT (t)
= et
(9.3)
261
(9.5)
the pdf for the exponential random variable, T , the waiting time until the rst
occurrence of Poisson events occurring at a constant mean rate, . This result
generalizes straightforwardly from time to spatial intervals, areas, volumes,
etc.
The Model and Some Remarks
In general, the expression
f (x) = ex ; 0 < x <
(9.6)
1 x/
e
;0 < x <
(9.7)
262
Random Phenomena
1.0
E
0.8
f(x)
0.6
0.4
0.2
0.0
10
20
30
40
(t)
1
(1 t)
1
(1 jt)
(9.8)
(9.9)
263
Applications
The exponential pdf, not surprisingly, nds application in problems involving waiting times to the occurrence of simple events. As noted earlier, it
is most recognizable to chemical engineers as the theoretical residence time
distribution function for ideal CSTRs; it also provides a good model for the
distribution of time intervals between arrivals at a post oce counter, or between phone calls at a customer service center.
Since equipment (or system) reliability and lifetimes can be regarded as
waiting times until failure of some sort or another, it is also not surprising
that the exponential pdf is utilized extensively in reliability and life testing
studies. The exponential pdf has been used to model lifetimes of simple devices
and of individual components of more complex ones. In this regard, it is important to pay attention to the last characteristic shown above: the constant
hazard function, h(x). Recall from Chapter 4 that the hazard function allows
one to compute the probability of surviving beyond time t, given survival up
to time t. The constant hazard function indicates that for the exponential
random variable, the risk of future failure is independent of current time. The
exponential pdf is therefore known as a memoryless distribution the only
distribution with this characteristic.
Example 9.1 WAITING TIME AT SERVICE STATION
The total number of trucks arriving at an allnight service station over
the 10:00 pm to 6:00 am night shift is known to be a Poisson random
variable with an average hourly arrival rate of 5 trucks/hour. The waiting time between successive arrivalsidle time for the service station
workerstherefore has the exponential distribution:
f (x) =
1 x/
;0 < x <
e
(9.10)
where = 1/5 hours. If the probability is exactly 0.5 that the waiting
time between successive arrivals is less than hours, nd the value of .
Solution:
The problem statement translates to:
5e5x dx = 0.5
P [X < ] =
(9.11)
or,
e5x 0 = 0.5
(9.12)
1 e5 = 0.5
(9.13)
ln 0.5
= 1.39
5
(9.14)
264
Random Phenomena
Note, of course, that by denition, is the median of the given pdf. The
practical implication of this result is that, in half of the arrivals at the
service station during this night shift, the waiting (idle) time in between
arrivals (on average) will be less than = 1.39 hours; the waiting time
will be longer than for the other half of the arrivals. Such a result
can be used in practice in many dierent ways: for example, the owner
of the service station may use this to decide when to hire extra help
for the shift, say when the median idle time exceeds a predetermined
threshold.
9.1.2
(9.15)
1)!
t
y=0
This result may be obtained (see Exercise 9.6) by rst dening:
I(k) =
ez z k1 dz
a
(9.18)
265
(9.19)
(9.20)
And now, if FT (t) is the cumulative distribution function of the random variable, T , then the right hand side of Eq (9.15) may be rewritten as
P [T > t] = 1 FT (t)
so that the complete expression in Eq (9.15) becomes
1
1 FT (t) =
ez z k1 dz
(k 1)! t
(9.21)
(9.22)
Upon dierentiating with respect to t, using Leibnitzs formula for dierentiating under the integral sign, i.e.,
d
dx
B(x)
f (x, r)dr =
A(x)
dB
dA
f (x, r)
dr + f (x, B)
f (x, A)
x
dx
dx
we obtain:
f (t) =
1 t
e (t)k1
(k)
(9.23)
(9.24)
where (k) is the Gamma function (to be dened shortly), and we have used
the fact that for integer k,
(k 1)! = (k)
(9.25)
k t k1
e t
;0 < t <
(k)
(9.26)
is the pdf for the waiting time to the k th occurrence of independent Poisson
events occurring at an average rate ; it is a particular case of the pdf for a
gamma random variable, generalized as follows.
The Model and Some Remarks
The pdf for the gamma random variable, X, is given in general by
f (x) =
1
ex/ x1 ; 0 < x <
()
(9.27)
266
Random Phenomena
() =
ey y 1 dy
(9.28)
ex/ x1 dx
(9.29)
(9.30)
indicating that the function being integrated on the RHS is a density function.
Note also from Eq (9.28), that via integration by parts, one can establish the
following well-known recursion property of the gamma function:
( + 1) = ()
(9.31)
267
0.4
D
D E
f(x)
0.3
E
0.2
D E
0.1
D E
0.0
20
D E
40
60
x
80
100
120
FIGURE 9.2: Gamma pdfs for various values of parameter and : Note how with
increasing values of the shape becomes less skewed, and how the breadth of the
distribution increases with increasing values of
1. Characteristic parameters: > 0 ; > 0
, the shape parameter, determines overall shape of the distribution
(how skewed or symmetric, peaked or at);
, the scale parameter, determines how wide the distribution is, with
larger values of corresponding to wider distributions (See Fig 9.2).
2. Mean: = E(X) = .
Other measures of central location: Mode = ( 1); 1
Median: No closed-form analytical expression.
3. Variance: 2 (X) = 2
4. Higher Moments: Coecient of Skewness: 3 = 21/2 ;
Coecient of Kurtosis: 4 = 3 + 6/,
implying that the distribution is positively skewed but becomes less
so with increasing , and sharply peaked, approaching the normal
reference kurtosis value of 3 as . (See Fig 9.2).
5. Moment generating and Characteristic functions:
M (t) =
(t)
1
(1 t)
1
(1 jt)
(9.33)
(9.34)
268
Random Phenomena
6. Survival function:
S(x) = ex/
1
(x/)i
i=0
i!
(9.35)
(x/)1
1
i
() i=0 (x/)
i!
(9.36)
7. Relation to the exponential random variable: If Yi is an exponential random variable with characteristic parameter , i.e. Yi E(),
then the random variable X dened as follows:
X=
Yi
(9.37)
i=1
n
is also a gamma random variable, with shape parameter = i=1 i
and scale parameter , i.e. Y ( , ). Thus, a sum of gamma random
variables with identical scale parameters begets another gamma random
variable with the same scale parameter, hence the term reproductive.
(Recall Example 6.6 in Chapter 6.) Furthermore, the random variable
Z dened as
n
Xi
(9.39)
Z =c
i=1
where c is a constant,
is also a gamma random variable with shape
parameter = ni=1 i but with scale parameter c, i.e. Z ( , c).
(See Exercise 9.9.)
269
Applications
The gamma pdf nds application in problems involving system time-tofailure when system failure occurs as a result of independent subsystem
failures, each occurring at a constant rate 1/. A standard example is the socalled standby redundant system consisting of n components where system
function requires only one component, with the others as backup; when one
component fails, another takes over automatically. Complete system failure
therefore does not occur until all n components have failed. For similar reasons,
the gamma pdf is also used to study and analyze time between maintenance
operations. Because of its exible shape, the gamma distribution is frequently
considered in modeling engineering data of general non-negative phenomena.
As an extension of the application of the exponential distribution in residence time distribution studies in single ideal CSTRs, the gamma distribution
may be used for the residence time distribution in several identical CSTRs in
series.
The gamma pdf is also used in experimental and theoretical neurobiology,
especially in studies involving action potentials the spike trains generated by neurons as a result of nerve-cell activity. These spike trains and the
dynamic processes that cause them are random; and it is known that the distribution of interspike intervals (ISI) the elapsed time between the appearance of two consecutive spikes in the spike train encode information about
synaptic mechanisms1 . Because action potential are generated as a result of a
sequence of physiological events, the ISI distribution is often well-modeled by
the gamma pdf2 .
Finally, the gamma distribution has been used recently to model the distribution of distances between DNA replication origins in cells. Fig 9.3, adapted
from Chapter 7 of Birtwistle (2008)3 , shows a (5.05, 8.14) distribution t to
data reported in Patel et al., (2006)4, on inter-origin distances in the budding yeast S. cerevisiae. Note the excellent agreement between the gamma
distribution model and the experimental data.
Example 9.2 REPLACING AUTOMOBILE TIMING BELTS
Automobile manufacturers specify in their maintenance manuals the
recommended mileage at which various components are to be replaced. One such component, the timing belt, must be replaced before
it breaks, since a broken timing belt renders the automobile entirely
inoperative. An extensive experimental study carried out before the ofcial launch of a new model of a certain automobile concluded that X,
1 Braitenberg, 1965: What can be learned from spike interval histograms about synaptic
mechanism? J. Theor. Biol. 8, 419425.
2 H.C. Tuckwell, 1988, Introduction to Theoretical Neurobiology, Vol 2: Nonlinear and
Stochastic Theories, Chapter 9, Cambridge University Press.
3 Birtwistle, M. R. (2008). Modeling and Analysis of the ErbB Signaling Network: From
Single Cells to Tumorigenesis, PhD Dissertation, University of Delaware.
4 Patel, P. K., Arcangioli, B., Baker, S. P., Bensimon, A. and Rhind, N. (2006). DNA
replication origins re stochastically in ssion yeast. Mol Biol Cell 17, 308-316.
270
Random Phenomena
25
20
Frequency %
DE
15
10
20
40
60
80
Interorigin distance, kb
100
the lifetimes of the automobiles timing belts (in 10,000 driven miles),
is well-modeled by the following pdf:
f (x) =
1
ex x9 ; 0 < x <
(10)
(9.40)
(9.41)
with the implication that the timing belt should be changed at or before
90,000 driven miles.
For this specic problem, E(X) = = 10, indicating that the
expected value is 100,000 miles, a value that is longer than the value at
the distributions mode. It appears therefore as if choosing x = 90, 000
miles as the recommended mileage for the timing belt replacement is a
safe and reasonable, if conservative choice. However, since the breakage
of a timing belt while the car is in operation on the road is highly
271
9.1.3
r
1
ex/2 x 2 1 ; 0 < x <
2r/2 (r/2)
(9.42)
This is the pdf for a 2 (r) random variable with r degrees of freedom. In
particular when r = 1, the resulting random variable, 2 (1), has the pdf:
f (x) =
1
ex/2 x1/2 ; 0 < x <
2(1/2)
(9.43)
2
r;
1
(1 2t)r/2
1
(1 j2t)r/2
(9.44)
(9.45)
272
Random Phenomena
9.1.4
To set the stage for the denition of the Weibull random variable, and a
derivation of its pdf, let us return briey to the discussion of the exponential
random variable and recall its hazard function, h(t) = , a constant. According
to the denition given in Chapter 4, the corresponding cumulative hazard
function (chf), H(t), is given by:
H(t) = t
(9.46)
(9.47)
Thus,
= h(t) =
d
H(t) = (t)(1)
dt
(9.48)
273
in Eq (9.48). Then, either by recalling the relationship between the cumulative hazard function and the cumulative distribution function of a random
variable, i.e.
FT (t) = 1 eH(t)
(9.49)
or else, following the derivation given above for the exponential random variable, we obtain the cumulative distribution function for this random variable,
T as:
FT (t) = 1 e(t)
(9.50)
Upon dierentiating once, we obtain:
f (t) = (t)1 e(t) ; 0 < t <
(9.51)
This is the pdf of a Weibull random variable, named for the Swedish scientist,
Waloddi Weibull (1887-1979), who derived and introduced this distribution
in a 1939 publication on the analysis of the breaking strength of materials5 .
It is important now to note that there is something of an empirical feel
to the Weibull distribution in the sense that if one can conceive of any other
reasonable hazard function, h(t), with the corresponding chf, H(t), such a
random variable will have a pdf given by
f (t) = h(t)eH(t)
(9.52)
There is nothing particularly phenomenological about the specic cumulative hazard function, H(t), introduced in Eq (9.47) which eventually led to the
Weibull distribution; it merely makes the simple linear chf H(t) = t of the
exponential random variable more complex by raising it to a generic power
an additional parameter that is to be chosen to t observations. Unlike the
parameter which has a phenomenological basis, there is no such basis for .
We shall have cause to return to this point a bit later.
The Model and Some Remarks
The pdf for the Weibull random variable, X, is given in general by:
f (x) = (x)1 e(x) ; 0 < x <
(9.53)
f (x) =
1
x
e(x/) ; 0 < x <
(9.54)
(9.55)
5 An earlier independent derivation due to Fisher and Tippet (1928) was unknown to
the engineering community until long after Weibulls work had become widely known and
adopted.
274
Random Phenomena
0.9
]
2
2
5
5
0.8
0.7
E
2
5
2
5
] E
f(x)
0.6
0.5
0.4
] E
0.3
0.2
0.1
0.0
] E
0
] E
10
12
14
FIGURE 9.4: Weibull pdfs for various values of parameter and : Note how with
increasing values of the shape becomes less skewed, and how the breadth of the
distribution increases with increasing values of
where
(9.56)
We prefer the form given in Eq (9.54). First, is the more natural parameter
(as we show shortly); second, its role in determining the characteristics of
the pdf is distinguishable from that of the second parameter in Eq (9.54),
whereas in Eq (9.55) is a convolution of the two parameters.
In the special case where = 1, the Weibull pdf, not surprisingly, reduces
to the exponential pdf. In general, the Weibull random variable, W (, ), is
related to the exponential random variable E() as follows: if Y E(), then
X = Y 1/
(9.57)
(9.58)
275
, as with all the other distribution in this family, is the scale parameter.
In this case, it is also known as the characteristic life for reasons
discussed shortly;
is the shape parameter. (See Fig 9.4).
2. Mean: = (1 + 1/); Mode = (1 1/)1/ ; 1; Median =
(ln 2)1/2 .
3. Variance: 2 = 2 (1 + 2/) [(1 + 1/)]2
4. Higher Moments: Closed form expressions for the coecients of Skewness and kurtosis are very complex, as are the expressions for the MGF
and the Characteristic function.
5. Survival function:
S(x) = e(x/) ; or e(x)
Hazard function:
h(x) =
1
x
, or (x)1
(9.59)
(9.60)
(9.61)
Applications
The Weibull distribution naturally nds application predominantly in reliability and life-testing studies. It is a very versatile pdf that provides a
particularly good t to time-to-failure data when mean failure rate is time
dependent. It is therefore utilized in problems involving lifetimes of complex
electronic equipment and of biological organisms, as well as in characterizing
failure in mechanical systems. While this pdf is sometimes used to describe the
size distribution of particles generated by grinding or crushing operations, it
should be clear from the derivation given in this section that such applications
are not as natural as life-testing applications. When used for particle size
characterization, the distribution is sometimes known as the Rosin-Rammler
distribution.
An interesting characteristic of the Weibull distribution arises from the
following result: when x = ,
P (X ) = 1 e1 = 0.632
(9.62)
276
9.1.5
Random Phenomena
1. Exponential: = = 1; = 0;
2. Gamma: = 1; = 0;
3. Chi-squared r: = r/2, = 2, = 1; = 0;
4. Weibull: = 1; = 0;
9.1.6
x e
; x = 0, 1, 2, . . . ,
x!
(9.64)
but this time, consider that the parameter is not constant, but is itself a
random variable. This will be the case, for example, if X represents the number
of automobile accidents reported to a company that insures a population of
clients for whom the propensity for accidents varies widely. The appropriate
model for the entire population of clients will then consist of a mixture of
Poisson random variables with dierent values of for the dierent subgroups
within the population. The two most important consequences of this problem
denition are as follows:
1. Both X and are random variables, and are characterized by a joint
pdf f (x, );
2. The pdf in Eq (9.64) must now properly be considered as the conditional
277
1
e/ 1 ; 0 < <
()
(9.65)
(9.66)
from where we may now obtain the desired marginal pdf f (x) by integrating
out , i.e.,
x
e
1
/ 1
e
f (x|)f ()d =
f (x) =
d (9.67)
x!
()
0
0
which is easily rearranged to yield:
1
f (x) =
x! ()
e/
(9.68)
where
1
=
=
x+
1+
(9.69)
(9.70)
The reason for such a parameterization is that the integral becomes easy to
determine by analogy with the gamma pdf, i.e.,
e/ 1 = ( ) ( )
(9.71)
0
( ) ( )
f (x) =
x! ()
(9.72)
278
Random Phenomena
If we now dene
1
1+ = ; =
p
1p
p
;
(9.74)
(x + 1)!
p (1 p)x
( 1)!x!
(9.75)
(1 p)
p
(9.76)
9.2
279
The Gaussian random variable, the rst, and dening member of this family, is unquestionably the most familiar of all random variables, nding broad
application in a wide variety of problems, most notably in statistical inference. Unfortunately, by the same token, it also is one of the most misused,
a point we shall return to later. The second, as its name (lognormal) suggests,
is derivative of the rst in the sense that a logarithmic transformation converts it to the rst random variable. It also nds application in statistical
inference, especially with such strictly non-negative phenomena as household
income, home prices, organism sizes, molecular weight of polymer molecules,
particle sizes in powders, crystals, granules and other particulate material, etc
entities whose values can vary over several orders of magnitude. The last
variable, perhaps the least familiar, is ideal for representing random deviations
of hits from a target on a plane. It owes its membership in the family to this
very phenomenological characteristic random uctuations around a target
that is shared in one form or another by other family members.
Once again, our presentation here centers around derivations of each random variables probability model, to emphasize the phenomena underlying
each one. A generalized model encompassing all family members is shown at
the end to tie everything together.
9.2.1
280
Random Phenomena
(9.77)
the total number of successes in excess of the theoretical mean (i.e. Y represents the deviation of X from the theoretical mean value). Observe that Y
is positive when the observed number of successes exceeds the mean number of successes, in which case X will lie to the right of x on the real
line. Conversely, a negative Y implies that there are fewer successes than the
mean number of successes (or equivalently, that there are more failures than
the mean number of failures), so that X will lie to the left of x . When
the number of successes matches the mean value precisely, Y = 0. If this
deviation variable is scaled by the standard deviation x , we obtain the
standardized deviation variable
Z=
X x
x
(9.78)
(9.81)
281
x
x
2 x
where O(x3 ) is a term that goes to zero faster than x3 as n (recall that
with
=p
jt
x
so that,
2 =
2
3
+
2
3
2
t
x
(9.83)
O(x3 )
p2 t2
+ O(x3 )
x2
(9.84)
(9.85)
(p p2 ) + O(x3 )
n
x
2 x2
From here, the entire Eq (9.82) becomes:
ln z (t) =
(9.86)
z (t) = et
/2
(9.87)
(9.88)
(9.90)
282
Random Phenomena
x 2
(xx )2
2
2x
(9.91)
This is the pdf of a Gaussian random variable with mean value x and standard deviation x inherited from the original binomial random variable.
II: First-Principles (Random Motion in a Line)
Consider a particle moving randomly along the x-axis with motion governed
by the following rules:
1. Each move involves taking a single step of xed length, x, once every
time interval t;
2. The step can be to the right (with probability p), or to the left (with
probability q = 1 p): for simplicity, this presentation will consider
equal probabilities, p = q = 1/2; the more general derivation is a bit
more complicated but the nal result is the same.
We are interested in the probability of nding a particle m steps to the right of
the starting point after making n independent moves, in the limit as n .
Before engaging in the model derivation, the following are some important
points about the integer m that will be useful later:
1. m can be negative or positive and is restricted to lie between n and n;
2. If k > 0 is the total number of steps taken to the right (so that the total
number of steps taken to the left is (n k)), for the particle to reach a
point m steps to the right implies that:
m = k (n k) = (2k n)
so that:
k=
1
(n + m)
2
(9.92)
283
1
1
P (x x, t) + P (x + x, t)
2
2
(9.93)
To reach point x at time (t + t), then at time t, one of two events must
happen: (i) the particle must reach point (x x) (with probability P (x
x, t)) and then take a step to the right (with probability 1/2); or (ii) the
particle must reach point (x + x) (with probability P (x + x, t)), and then
take a step to the left (with probability 1/2). This is what is represented by
Eq (9.93); in the limit as n , it provides an expression for the pdf we
wish to derive. The associated initial conditions are:
P (0, 0) =
P (x, 0) =
1
0; x = 0
(9.94)
indicating that, since we began at the origin, the probability of nding this
particle at the origin, at t = 0, is 1; and also that the probability of nding
the particle, at this same time t = 0, at any other point, x that is not the
origin, is 0.
Now, as n , for t to remain xed, t must tend to zero; similarly, as
n , so must m , and for x to remain xed, x must tend to zero
as well. However, by denition, m < n so that as both become large, n
faster than m so that m/n 0; i.e.
x t
x t
m
=
=
0
n
x t
t x
implying that x/t but in such a way that
x
x
= xed
t
(9.95)
(9.96)
RHS =
2
x
x
x
P (x, t + t) P (x, t) =
x0
t0
(x2 )
= D = 0
2t
(9.99)
284
Random Phenomena
P (x, t)
2x
(9.101)
k
f (mx, t)2x
(9.102)
m=i
(keeping in mind that the sum is for every other value of m from i to k, with
both indices even if n is even, and odd if n is odd). Eq (9.100) now becomes
2 f (x, t)
f (x, t)
=D
t
x2
(9.103)
so that, as the reader may perhaps have suspected all along, f (x, t) is the
required (time dependent) probability density function for the random phenomenon in question random motion on a line.
We are now in a position to solve Eq (9.103), but the initial conditions for
P (x, t) in Eq (9.94) must now be modied appropriately for f (x, t) as follows:
f (x, t)dx = 1
lim f (x, t) = 0
t0
(9.105)
The rst implies that, at any point in time, the particle will, with certainty, be
located somewhere on the xaxis; the second is the continuous equivalent of
285
(9.106)
(9.108)
where the time argument t has been suppressed because it is no longer explicit (having been subsumed into the dispersion parameter b). Finally, for an
arbitrary starting point a = 0 the required pdf becomes:
(xa)2
1
f (x) = e 2b2
b 2
(9.109)
286
Random Phenomena
A
y
r
r cos ;
y
r2
=
=
r sin ;
x2 + y 2 .
(9.110)
(9.111)
(9.112)
(9.113)
f (r cos )
f (r sin )
+ r cos
f (r cos )
f (r sin )
(9.114)
(9.115)
287
It is now clear that x and y are entirely independent since the LHS is a function
of x alone and the RHS is a function of y alone; furthermore, these two will
then be equal only if they are both equal to a constant, say c1 , i.e.
f (x)
= c1 x
f (x)
(9.116)
1
c1 x2 + c2 ; or
2
2
1
= ke 2 c1 x
=
(9.117)
f (x) = ke 2 b2
(9.118)
f (y) = ke 2 b2
(9.119)
so that
1 r2
(9.120)
In general, if the point O is not the origin but some other arbitrary point
(ax , ay ) in the plane, then Eq (9.118) becomes:
f (x) = ke
And now, since
(xax )2
2b2
(9.121)
(xax )2
2b2
dx = b 2,
(9.122)
then it follows that k = 1/b 2 so that the required pdf is given by:
(xax )2
1
f (x) = e 2b2
b 2
(9.123)
288
Random Phenomena
(9.124)
0.4
P
0
0
3
5
P V
289
V
1
5
2
5
f(x)
0.3
0.2
P V
0.1
P V
P V
0.0
-20
-10
10
20
FIGURE 9.6: Gaussian pdfs for various values of parameter and : Note the symmetric shapes, how the center of the distribution is determined by , and how the shape
becomes broader with increasing values of
1
exp{t + 2 t2 }
2
1
exp{jt 2 t2 }
2
(9.126)
(9.127)
290
Random Phenomena
Applications
The Gaussian random variable plays an important role in statistical inference where its applications are many and varied. While these applications
are discussed more fully in Part IV, we note here that they all involve computing probabilities using f (x), or else, given specied tail area probabilities,
using f (x) in reverse to nd the corresponding x values. Nowadays, the tasks
of carrying out such computations have almost completely been delegated to
computer programs; traditionally, practical applications required the use of
pre-computed tables of normal cumulative probability values. Because it is
impossible (and unrealistic) to generate tables for all conceivable values of
and , the traditional normal probability tables are based on the standard
normal random variable, Z, which, as we now discuss, makes it possible to
apply these tables for all possible values of and .
9.2.2
If the random variable X possesses a N (, 2 ) distribution, then the random variable dened as:
X
(9.128)
Z=
z 2
2
(9.129)
Z is called a standard normal random variable; its mean value is 0, its standard
deviation is 1, i.e. it has a N (0, 1) distribution. A special case of the general
Gaussian random variable, its traditional utility derives from the fact that for
any general Gaussian random variable X N (, 2 ),
P (a1 < X < a2 ) = P
a1
a2
<Z <
(9.130)
so that tables of N (0, 1) probability values for various values of z can be used
to compute probabilities for any and all general N (, 2 ) random variables.
The z-score of any particular value xi of the general Gaussian random
variable X N (, 2 ) is dened as
zi =
xi
(9.131)
291
0.4
f(x)
0.3
0.2
0.1
0.025
0.0
0.025
-1.96
0
x
1.96
FIGURE 9.7: Symmetric tail area probabilities for the standard normal random variable
with z = 1.96 and FZ (1.96) = 0.025 = 1 FZ (1.96)
where FZ (a) is the cumulative probability dened in the usual manner as:
a
FZ (a) =
f (z)dz
(9.133)
Figure 9.7 shows this for the specic case where a = 1.96 for which the tail
areas are each 0.025.
This result has the implication that tables of tail area probabilities need
only be made available for positive values of Z. The following example illustrates this point.
Example 9.3 POST-SECONDARY EXAMINATION TEST
SCORES
A collection of all the test scores for a standardized, post-secondary examination administered in the 1970s across countries along the West
African coast, is well-modeled as a random variable X N (, 2 ) with
= 270 and = 26. If a score of 300 or higher is required for a passwith-distinction grade, and a score between 260 and 300 is required
for a merit-pass grade, what percentage of students will receive the
distinction grade and what percentage will receive the merit grade?
Solution:
The problem requires computing the following probabilities: P (X
300) and P (260 < X < 300).
300 270
P (X 300) = 1 P (X < 300) = 1 P Z <
26
(9.134)
= 1 FZ (1.154)
292
Random Phenomena
indicating that the z-score for x = 300 is 1.154. From tables of cumulative probabilities for the standard normal random variable, we obtain
FZ (1.154) = 0.875 so that the required probability is given by
P (X 300) = 0.125
(9.135)
implying that 12.5% of the students will receive the distinction grade.
The second probability is obtained as:
300 270
260 270
<Z<
P (260 < X < 300) = P
26
26
30
10
FZ
(9.136)
= FZ
26
26
And now, by symmetry, F (10/26) = 1 F (10/26) so that, from the
cumulative probability tables, we now obtain:
P (260 < X < 300) = 0.875 (1 0.650) = 0.525
(9.137)
with the implication that 52.5% of the students will receive the merit
grade.
Of course, with the availability of such computer programs as
MINITAB, it is possible to obtain the required probabilities directly
without recourse to the Z probability tables. In this case, one simply
obtains FX (300) = 0.875 and FX (260) = 0.35 from which the required
probabilities and percentages are obtained straightforwardly.
i=1
possesses a 2 (n) distribution (See Exercise 9.20). These results nd extensive
application in statistical inference.
9.2.3
n
Xi
293
(9.139)
i=1
n
ln Xi =
i=1
n
Yi
(9.140)
i=1
g(y) =
exp
(9.141)
2 2
2
Using techniques discussed in Chapter 6, it can be shown (see Exercise 9.21)
that from the variable transformation and its inverse,
Y
X
=
=
ln X
eY
(9.142)
(9.143)
(9.144)
x 2
exp
(ln x/m)2
2 2
;0 < x <
(9.145)
294
Random Phenomena
(9.146)
2
2. Mean: E(X) = exp( + 2 /2) = me /2 = m w;
2
Mode = m/w = e( )
Median = m = e
2
2
3. Variance: V ar(X) = e(2+ ) e 1 = m2 w(w 1)
Note that V ar(X) = [E(X)]2 (w 1)
4. Higher Moments: Coecient of Skewness: 3 = (w + 2) (w 1);
Coecient of Kurtosis: 4 = w4 + 2w3 + 3w2 3.
5. Moment generating and Characteristic functions: Even though
all moments exist for the lognormal distribution, the MGF does not
exist. The characteristic function exists but is quite complicated.
A plot of the lognormal distribution for various values of the shape parameter
, with the scale parameter xed at = 0, is shown in Fig 9.8. On the other
hand, a plot for various values of , with the shape parameter xed at = 1,
is shown in Fig 9.9.
An important point that must not be missed here is as follows: whereas for
the Gaussian distribution is a location parameter responsible for shifting
the distribution, the corresponding parameter for the lognormal distribution,
, does not shift the distributions location but rather scales its magnitude.
This is consistent with the fact that the additive characteristics underlying
the Gaussian distribution correspond to multiplicative characteristics in the
lognormal distribution. Thus, while a change in shifts the location of the
Gaussian distribution, a change in scales (by multiplication) the lognormal
distribution. The parameter is a location parameter only for the distribution
of ln X (which is Gaussian) not for the distribution of X.
A nal point to note: while the most popular measure of central location,
the mean E(X), depends on both and for the lognormal distribution, the
median on the other hand, m = e , depends only on . This suggests that
the median is a more natural indicator of central location for the lognormal
295
1.8
1.6
E
1.5
1
0.5
0.25
E
1.4
f(x)
1.2
1.0
0.8
E
E
0.6
0.4
0.2
0.0
E
0
10
FIGURE 9.8: Lognormal pdfs for scale parameter = 0 and various values of the
shape parameter . Note how the shape changes, becoming less skewed as becomes
smaller.
0.4
D
0.5
1
2
D
f(x)
0.3
0.2
D
0.1
D
0.0
10
20
30
x
40
50
60
FIGURE 9.9: Lognormal pdfs for shape parameter = 1 and various values of the
scale parameter . Note how the shape remains unchanged while the entire distribution
is scaled appropriately depending on the value of .
296
Random Phenomena
random variable. By the same token, a more natural measure of dispersion for
this random variable is V ar(X)/[E(x)]2 , the variance scaled by a square of
the mean value, or, equivalently, the square of Cv , the coecient of variation:
from the expression given above for the variance, this quantity is (w 1),
2
depending only on w = e .
Applications
From the preceding considerations regarding the genesis of the lognormal
distribution, it is not surprising that the following phenomena generate random variables that are well-modeled with the lognormal pdf:
1. Size of particles obtained by breakage (grinding) or granulation of ne
particles;
2. Size of living organisms, especially where growth depends on numerous
factors proportional to instantaneous organism size;
3. Personal income, or net worth, or other such quantities for which the
current observation is a random proportion of the previous value (e.g.,
closing price of stocks, or index options).
The lognormal distribution is therefore used for describing particle size distributions in mining and granulation processes as well as in atmospheric studies;
for molecular weight distributions of polymers arising from complex reaction
kinetics; for distributions of incomes in a free market economy.
Because it is a skewed distribution just like the gamma distribution, the
lognormal distribution is sometimes used to describe such phenomena as latent periods of infectious diseases, or age at the onset of such diseases as
Alzheimers or arthritis phenomena that are more naturally described by
the gamma density since they involve the time to the occurrence of events
driven by multiple cumulative eectors.
Finally, in statistical inference applications, probabilities are traditionally
computed for lognormal random variables using Normal probability tables.
Observe that if X L(, 2 ), then:
P (a1 < X < a2 ) = P (ln a1 < Y < ln a2 )
(9.147)
where Y N (, 2 ). Thus, using tables of standard normal cumulative probabilities, one is able to obtain from Eq (9.147) that:
ln a1
ln a2
P (a1 < X < a2 ) = P
<Z <
ln a2
ln a1
= F
F
(9.148)
297
(9.149)
9.2.4
(9.150)
298
Random Phenomena
0.0010
Shaded area probability
0.858
f(x)
0.0008
0.0006
0.0004
0.0002
0.0000
350
1650
x
FIGURE 9.10: Particle size distribution for the granulation process product: a lognormal distribution with = 6.8, = 0.5. The shaded area corresponds to product meeting
quality specications, 350 < X < 1650 microns.
description is exactly as in the Herschel/Maxwell model presented earlier,
except that this time
(9.152)
x = r2
Thus, from Eq (9.120), we obtain immediately that,
f (x) =
x 12 x22
e b ; x > 0; b > 0
b2
(9.153)
This is the pdf for the Rayleigh random variable R(b). It can be shown via
methods presented in Chapter 6 that if Y1 N (0, b2 ) and Y2 N (0, b2 ) then
X=
%
Y12 + Y22
(9.154)
299
4. Higher Moments:
(9.155)
300
Random Phenomena
precisely the conceptual model used to derive the Rayleigh pdf; it shows
how = 2 in the Weibull distribution is structurally compatible with
the phenomenon underlying the Rayleigh random variable.
Applications
The Rayleigh distribution nds application in military studies of battle
damage assessment, especially in analyzing the distance of bomb hits from
desired targets (not surprising given the discussion above). The distribution
is also used in communication theory for modeling communication channels,
and for characterizing satellite Synthetic Aperture Radar (SAR) data.
9.2.5
To conclude, we note that all three random variables discussed above can
be represented as special cases of the random variable X with the following
pdf:
#
f (x) = C1 (x) exp
(C2 (x) C3 )
2C4
$
(9.156)
9.3
301
This grouping of random variables is far more diverse than any of the
earlier two groupings. From the rst two that are dened only on bounded
regions of nite size on the real line, to the last two that are always symmetric
and are dened on the entire real line, and the third one that is dened only
on the semi-innite positive real line, these random variables appear to have
nothing in common. Nevertheless, all members of this group do in fact share a
very important common characteristic: as the family name implies, they all
arise as ratios composed of other (previously encountered) random variables.
Some of the most important random variables in statistical inference belong to this family; and one of the benets of the upcoming discussion is that
the ratios from which they arise provide immediate indications (and justication) of the role each random variable plays in statistical inference. In the
interest of limiting the length of a discussion that is already quite long, we
will simply state the results, suppressing the derivation details entirely, or else
referring the reader to appropriate places in Chapter 6 where we had earlier
provided such derivations in anticipation of these current discussions.
9.3.1
1 1 y1
y
e ; 0 < y1 <
() 1
1 1 y2
y
e ; 0 < y2 <
() 2
(9.158)
(9.159)
302
Random Phenomena
( + ) 1
x
(1 x)1 ; 0 < x < 1; > 0; > 0
()()
(9.160)
( + ) 1
x
(1 x)1 ; 0 < x < 1; > 0; > 0
()()
(9.161)
The name arises from the relationship between the pdf above and the Beta
function dened by:
1
()()
(9.162)
u1 (1 u)1 du =
Beta(, ) =
( + )
0
This pdf in Eq (9.161) is the rst continuous model to be restricted to a nite
interval, in this case, x [0, 1]. This pdf is dened on this unit interval, but
as we show later, it is possible to generalize it to a pdf on an arbitrary nite
interval [0 , 1 ].
Important Mathematical Characteristics
The following are some key mathematical characteristics of the B(, )
random variable and its pdf:
1. Characteristic parameters: > 0 and > 0 are both shape parameters. The pdf takes on a wide variety of shapes depending on the values
of these two parameters. (See below.)
2. Mean: E(X) = /( + );
Mode = ( 1)/( + 2) for > 1, > 1, otherwise no mode exists.
There is no closed form expression for the Median.
3. Variance: V ar(X) = /( + )2 ( + + 1)
4. Higher Moments:
Coecient of Skewness:
2( )
3 =
( + + 2)
&
1
1
1
+ +
Coecient of Kurtosis:
4 =
.
3( + )( + + 1)( + 1)(2 ) ( )
+
( + + 2)( + + 3)
( + )
303
304
Random Phenomena
3.0
2.5
D E
D E
D E
2.0
f(x)
D
1.5
1.5
5
5
E
5
1.5
5
1.5
1.5
1.0
D E
0.5
0.0
0.0
0.2
0.4
0.6
0.8
1.0
FIGURE 9.11: Unimodal Beta pdfs when > 1; > 1: Note the symmetric shape
when = , and the skewness determined by the value of relative to
3.0
D
0.8
0.2
0.2
0.5
D E
2.5
E
0.2
0.2
0.8
0.5
f(x)
2.0
D E
1.5
1.0
D E
0.5
D E
0.0
0.0
0.2
0.4
0.6
0.8
1.0
3.0
305
D
1
0.5
0.5
0.2
2.5
D E
E
2
2
1
2
f(x)
2.0
1.5
D E
1.0
D E
0.5
0.0
D E
0.0
0.2
0.4
0.6
0.8
1.0
FIGURE 9.13: Other shapes of the Beta pdfs: It is J-shaped when ( 1)( 1) < 0
and a straight line when = 2; = 1
Applications
The Beta distribution naturally provides a good model for many random
phenomena involving proportions. For example, it is used in Bayesian analysis (see later) for describing `
a-priori knowledge about p, the Binomial probability of success. Another example practical application (mostly in quality
assurance) arises from the following result:
Given n independent random observations, 1 , 2 , , n from a phenomenon possessing an arbitrary pdf, rank the observations from the smallest
to the largest as y1 , y2 , , yn (i.e. with y1 as the smallest and yn as the
largest). If yr is the rth -smallest and yns+1 is the sth -largest, then regardless
of the underlying pdf of the variable, X, the proportion of the population between yr and yns+1 possesses a B(, ) distribution with = (n s + 1) r,
and = r + s, i.e.
f (x; n, r, s) =
(n + 1)
xnrs (1 x)r+s1 ;
(n r s + 1)(r + s)
(9.163)
This important result frees one from making any assumptions about the underlying pdf of the population from which the original quality data 1 , 2 , , n
came from.
The example below illustrates yet another application of the Beta distribution, in functional genomics studies.
Example 9.5 DIFFERENTIAL GENE EXPRESSION STATUS FROM MICROARRAYS
In functional genomics, one of the objectives is to provide a quantitative (as opposed to qualitative) understanding of the functions of genes
306
Random Phenomena
and how they regulate the function of complex living organisms. The
advent of the high-throughput microarray technology has made it possible to collect expression data on every gene in a cell simultaneously.
Such microarray data, usually presented in the form of uorescence
signal intensities measured from each spot i on a microarray, yield an
ordered pair (yi1 , yi0 ) with yi1 coming from the gene in question under
test conditions (e.g. from a cancer cell), and yi0 under control conditions (e.g. normal cell). In theory, if the gene is up-regulated under test
conditions, F C = yi1 /yi0 > 1, F C < 1 if down-regulated, and F C = 1
if non-dierentially regulated. This quantity, F C, is the so-called foldchange associated with this gene under the test conditions. However,
because of measurement noise and other myriad technical considerations, this ratio is dicult to characterize statistically and not terribly
reliable by itself.
It has been suggested in Gelmi6 , to use the fractional intensity xi
dened by:
xi =
yi1
yi1 + yi0
(9.164)
6 Gelmi, C. A. (2006)A novel probabilistic framework for microarray data analysis: From
fundamental probability models to experimental validation, PhD Thesis, University of
Delaware.
307
3.5
3.0
f(x)
2.5
D E
2.0
1.5
1.0
0.5
0.0431
0.0
0.5
x
FIGURE 9.14: Theoretical distribution for characterizing fractional microarray intensities for the example gene: The shaded area corresponds to the probability that the gene
in question is upregulated.
9.3.2
1
1
1
( + ) x 0
x 0
(9.165)
1
(1 0 ) ()() 1 0
1 0
0 < x < 1 ; > 0; > 0
Y1
Y2
(9.167)
308
Random Phenomena
( + )
x1
; x > 0; > 0; > 0
()() (1 + x)+
(9.168)
9.3.3
(9.169)
a pdf of a random variable that is uniform on the interval (0,1). The general
uniform random variable, X, dened on the interval (a, b) has the pdf:
f (x) =
1
;a < x < b
ba
(9.170)
It is called a uniform random variable, U (a, b), for the obvious reason that,
unlike all the other distributions discussed thus far, the probability description
for this random variable is completely uniform, favoring no particular value
in the specied range. The uniform random variable on the unit interval (0,1)
is known as a standard uniform random variable.
Important Mathematical Characteristics
The following are some key mathematical characteristics of the U (a, b)
random variable and its pdf:
1. Characteristic parameters: a, b jointly form the range, with a as the
location parameter (see Fig 9.15). Narrower distributions are longer than
wider ones because the total probability area must equal 1 in every case.
2. Mean: E(X) = (a + b)/2;
Median = Mean;
Mode: non-unique; all values in interval (a, b);
3. Variance: V ar(X) = (b a)2 /12
4. Moment generating and Characteristic functions:
M (t) =
(t) =
ebt eat
;
t(b a)
ejbt ejat
jt(b a)
309
a b
0 1
2 10
1.0
0.8
U(0,1)
f(x)
0.6
0.4
0.2
0.0
U(2,10)
10
FIGURE 9.15: Two uniform distributions over dierent ranges (0,1) and (2,10). Since
the total area under the pdf must be 1, the narrower pdf is proportionately longer than
the wider one.
Of the direct relationships between the U (a, b) random variable and other
random variables, the most important are (i) If X is a U (0, 1) random variable,
then Y = ln X is an exponential random variable E(). This result was
established in Example 6.2 in Chapter 6. (ii) If X U (0, 1) then Y = 1X 1/
is a Beta random variable B(1, ).
Applications
The uniform pdf is the obvious choice for describing equiprobable events on
bounded regions of the real-line. (The discrete version is used for equiprobable
discrete events in a sample space.) But perhaps its most signicant application
is for generating random numbers for other distributions.
9.3.4
310
Random Phenomena
(9.172)
2
; 2 > 2
2 2
2 (1 2)
; 1 > 2
1 (1 + 2)
3. Variance:
V ar(X) = 2 =
222 (1 + 2 2)
; 2 > 4
1 (2 2)2 (2 4)
(9.173)
Expressions for Skewness and Kurtosis are a bit complicated; the MGF
does not exist and the expression for the CF is complicated.
Figure 9.16 shows two F distributions for the same value of 2 = 15 but
dierent values of 1 .
The F distribution is related to two additional pdfs as follows: If X has
an F distribution with 1 , 2 degrees of freedom, then
Y =
(1 /2 ) X
1 + (1 /2 ) X
(9.174)
311
0.9
Q Q
5 15
10 15
0.8
Q Q
0.7
f(x)
0.6
0.5
0.4
Q Q
0.3
0.2
0.1
0.0
FIGURE 9.16: Two F distribution plots for dierent values for 1 , the rst degree
of freedom, but the same value for 2 . Note how the mode shifts to the right as 1
increases
Applications
The F distribution is used extensively in statistical inference to make probability statements about the ratio of variances of random variables, providing
the basis for the F -test. It is the theoretical probability tool for ANOVA
(Analysis of Variance). Values of P (X x) are traditionally tabulated for
various values of 1 , 2 and selected values of x, and referred to as F -tables.
Once again, computer programs have made such tables somewhat obsolete.
As shown in Part IV, the F distribution is one of the central quartet
of pdfs at the core of statistical inference the other three being the Gaussian (Normal) N (, 2 ), the Chi-square 2 (r), and the Student t-distribution
discussed next.
9.3.5
(9.176)
312
Random Phenomena
df
5
10
100
0.4
f(x)
0.3
0.2
0.1
0.0
Q
Q
-5
-4
-3
-2
-1
0
x
FIGURE 9.17: Three tdistribution plots for degrees of freedom values = 5, 10, 100.
Note the symmetrical shape and the heavier tail for smaller values of .
is a Students t random variable t(). It can be shown that its pdf is:
"
!
12 ( + 1)
1
; < x <
(9.177)
f (x) =
2
()(/2) 1 + x (+1)/2
313
N(0,1)
0.4
t(5)
f(x)
0.3
0.2
0.1
0.0
-5
-4
-3
-2
-1
0
x
FIGURE 9.18: A comparison of the tdistribution with = 5 with the standard normal
N (0, 1) distribution. Note the similarity as well as the t-distributions comparatively
heavier tail.
N(0,1)
0.4
t(50)
f(x)
0.3
0.2
0.1
0.0
-4
-3
-2
-1
0
x
314
Random Phenomena
9.3.6
Y1
Y2
(9.178)
1
1
; < x <
(1 + x2 )
(9.180)
315
N(0,1)
0.4
Cauchy (0,1)
f(x)
0.3
0.2
0.1
0.0
-10
-5
0
x
10
FIGURE 9.20: A comparison of the standard Cauchy distributions with the standard
normal N (0, 1) distribution. Note the general similarities as well as the Cauchy distributions substantially heavier tail.
the expression for the standard Cauchy distribution an expression that was
derived in Example 6.9 in Chapter 6.
The pdf for the general Cauchy random variable, C(, 2 ), is:
f (x) =
1
1
'
( ; < x < ; > 0
1 + (x)2
(9.181)
In a manner somewhat reminiscent of the t-distribution, the Cauchy distribution is also symmetric (about in the general case, about 0 in the standard
case), but with much heavier tails than the normal distribution. (See Fig 9.20.)
316
Random Phenomena
Of the relationships between the Cauchy random variable and other random
variables, the most notable are (i) By its very composition as a ratio of zero
mean Gaussian random variables, if X is a Cauchy random variable, its reciprocal 1/X is also a Cauchy random variable; (ii) The standard Cauchy random
variable C(0, 1) is a special (pathological) case of the t distribution with = 1
degrees of freedom. When we discuss the statistical implication of degrees of
freedom in Part IV, it will be come clearer why = 1 is a pathological case.
Applications
The Cauchy distribution is used mostly to represent otherwise symmetric
phenomena where the occurrences of extreme values that are signicantly far
from the central values are not so rare. The most common application is in
modeling high-resolution rates of price uctuations in nancial markets. Such
data tend to have heavier tails and are hence poorly modeled by Gaussian
distributions.
It is not dicult to see, from the genesis of the Cauchy random variable as
a ratio of two Gaussians, why such applications are structurally reasonable.
Price uctuation rates are approximated as a ratio of P , the change in the
price of a unit of goods, and t the time interval over which the price change
has been computed. Both are independent random variables (prices may remain steady for a long time and then change rapidly over short periods of time)
that tend, under normal elastic market conditions to uctuate around some
mean value. Hence, P/t will appear as a ratio of two Gaussian random
variables and will naturally follow a Cauchy distribution.
9.4
Using precisely the same techniques and principles employed in the previous chapter, we have turned our attention in this chapter to the complementary task of model development for continuous random variables, with
the discrete Poisson random variable serving as the connecting bridge between the two chapters. Our emphasis has been on the fundamentals of how
each probability model arises for these continuous random variables, with the
derivation details presented explicitly in many cases, or else left as exercises.
By design, we have encountered these continuous random variables and
their models in family groups whose members share certain common structural traits: rst, the Gamma family of strictly positive random variables,
typically used to represent phenomena involving intervals of time and space
(length, area, volume), or, as with the Chi-square random variable, squared
and other variance-related phenomena; next the Gaussian family, with functional forms indicative of squared deviations from a target; and nally the
317
REVIEW QUESTIONS
1. What are the four members of the Gamma family of random variables discussed
in this chapter?
2. What are the common structural characteristics shared by the members of the
Cauchy
C(0, 1)
x2
2b2
See Eq (9.172)
1
ba
(, )
1
1
(1+x2 )
x)1
(ln x)2
2 2
(+) 1
(1
()() x
exp
(x)2
22
(, ) See Eq (9.177)
(a, b)
(0, )
(0, 1)
R(b2 )
Beta
B(, )
Uniform U (a, b)
Fisher
F (1 , 2 )
Students
t()
x
b2
(0, )
Rayleigh
exp
x 2
(0, )
Lognormal
L(, )
2
2 (1 + 2/) 2
2r
Variance ( 2 )
V ar(X)
2
Median=0
a+b
2
2
2 2
N/A
See Eq (9.173)
(ba)2
12
(+)2 (++1)
(1 + 1/)
1 (x/)
e
(x/)
(0, )
exp
1
ex/2 xr/21
2r/2 (r/2)
(0, )
1
2
1
x/ 1
x
() e
(0, )
(0, )
Mean ()
E(X)
Probability Model
f (x)
1 x/
e
Range
Gaussian
(, )
2
Normal N (, )
Weibull
W (, )
Random
Variable
Exponential
E()
Gamma
(, )
Chi-Square
2 (r)
TABLE 9.1:
Y /
Y1 , Y2 , N (0, 1)
Y1 /Y2 C(0, 1)
B( = ) = U (0, 1)
Y1 2 (1 ); Y2 2 (2 )
Y1 /1
Y2 /2 F (1 , 2 )
lim t() = N (0, 1)
Z N (0, 1); Y 2 ()
Z t()
Y1 , Y2 , N (0, b2 )
Y12 + Y22 R(b2 )
X1 (, 1); X2 (, 1)
X1
X1 +X2 B(, )
Y N (, 2 )
X = eY L(, )
limn Bi(n, p)
Xi E()
n
i=1 Xi (, )
2 (r) = (r/2, 2)
Xi N (0, 1)
n
2
i=1 Xi (n)
Y E()
Y W (, )
Relation to
Other Variables
Inter-Poisson intervals
318
Random Phenomena
Uniform
(Discrete)
319
DISCRETE
Hypergeometric
Bernoulli
Binomial
[Multinomial]
Negative
Binomial
(Pascal)
Geometric
Poisson
CONTINOUS
GAUSSIAN
FAMILY
Weibull
Exponential
CONTINOUS
GAMMA
FAMILY
Gamma
(Erlang)
Gaussian
(Normal)
Lognormal
Rayleigh
Cauchy
CONTINOUSRATIOFAMILY
Chi-Sq (1)
Students t
Chi-Sq (r)
Fishers F
Beta
Uniform
(Continuous)
320
Random Phenomena
321
22. What are the basic characteristics of the Weibull random variable?
23. What is the probability model for the Weibull random variable?
24. Why is the Weibull pdf parameter known as the characteristic life?
25. The Weibull pdf nds application in what class of problems?
26. What mixture pdf arises from a Poisson pdf whose parameter is gamma distributed?
27. What are the three members of the Gaussian family of random variables discussed in this chapter?
28. What are the common structural characteristics shared by the members of the
Gaussian family of random variables?
29. What are the three approaches used in this chapter to derive the probability
model for the Gaussian random variable?
30. What are the basic characteristics of the Gaussian random variable?
31. What is the probability model for the Gaussian random variable and what do
the parameters represent?
32. In what broad area does the Gaussian pdf play an important role?
33. What is the probability model for the standard normal random variable? How
is the standard normal random variable related to the Gaussian random variable?
34. What is the z-score of any particular value xi of the general Gaussian random
variable with mean , and variance 2 ? How is it useful for computing probabilities
for general Gaussian distributions?
35. What are the basic characteristics of the lognormal random variable?
36. How is the lognormal random variable related to the Gaussian (normal) random
variable?
37. What is the probability model for the lognormal random variable?
38. What trap is to be avoided in interpreting what the parameters of the lognormal
distribution represent?
39. What is the dierence between the parameter for the normal distribution
and the corresponding parameter for the lognormal distribution in terms of how
changes in each parameter modify the distribution it characterizes?
322
Random Phenomena
323
60. What are the four central pdfs used most extensively in statistical inference?
61. Students t random variable is composed as a ratio of which random variables?
62. What is the relationship between Students t distribution and the standard normal distribution?
63. What is the t distribution used for most extensively?
64. The Cauchy random variable is composed as a ratio of which random variables?
65. What is the probability model for the Cauchy random variable?
66. How many moments exist for the Cauchy distribution?
67. The Cauchy distribution is used mostly for what?
EXERCISES
Section 9.1
9.1 (i) On the same graph, plot the pdf for the discrete geometric G(0.25) and the
continuous exponential E (4) distributions. Repeat this for the following additional
pairs of distributions: G(0.8) and E (1.25); and G(0.5) and E (2).
(ii) These plots show specic cases in which the pdf of the geometric random variable
G(p) is seen to be a discretized version of the continuous pdf for the exponential
random variable E (1/p), and vice-versa: that the E () distribution is a continuous
version of the discrete G(1/) distribution. First, show that for the geometric random
variable, the following relationship holds:
f (x + 1) f (x)
=p
f (x)
which is a nite dierence discretization of the expression,
df (x)
= p.
f (x)
From here, establish the general result that the pdf of a geometric random variable
G(p) is the discretized version of the pdf of the exponential random variable, E (1/p).
9.2 Establish that the median of the exponential random variable, E (), is ln 2,
and that its hazard function is
h(t) =
1
=
9.3 Given two independent random variables, X1 and X2 , with identical exponential
E () distributions, show that the pdf of their dierence,
Y = X1 X2
(9.182)
324
Random Phenomena
|y|
e
; < y <
2
(9.183)
where = 1/.
9.4 Revisit Exercise 9.3. Directly from the pdf in Eq (15.183), and the formal denitions of moments of a random variable, obtain the mean and variance of Y . Next,
obtain the mean and variance from Eq (15.182) by using the result that because the
two random variables are independent,
E(Y )
E(X1 ) E(X2 )
V ar(Y )
V ar(X1 ) + V ar(X2 )
9.5 Given that X E (1), i.e., an exponentially distributed random variable with
parameter = 1, determine the following probabilities:
(i) P (X X 3X ) where X is the mean value, and X is the standard deviation,
the positive square root of the variance, 2 .
(ii) P (X 2X < X < X + 2X )
9.6 (i) Establish the result in Eq (9.17).
(ii) From the denition of the gamma function:
ey y 1 dy
() =
(9.184)
1
ex/ x1 ; x > 0; , > 0
()
Yi
i=1
325
n
Xi
i=1
n
n
i=1
i and scale
Xi
i=1
= n
i=1 i but with scale parameter c, i.e. Z ( , c).
9.10 The distribution of residence times in a standard size continuous stirred tank
reactor (CSTR) is known to be exponential with = 1, i.e., E (1). If X is the
residence time for a reactor that is ve times the standard size, then its distribution
is also known as E (0.2). On the other hand, Y , the residence time in an ensemble
of ve identical, standard size CSTRs in series, is known to be gamma distributed
with = 5; = 1.
(i) Plot the pdf f (x) for the single large CSTRs residence time distribution and the
pdf f (y) for the ensemble of ve identical small reactors in series. Determine the
mean residence time in each case.
(ii) Compute P (Y 5) and compare with P (X 5)
9.11 Given that X (, ), show that the pdf for Y , the Inverse Gamma, IG,
random variable dened by Y = 1/X is given by:
f (y) =
(1/) (1/)/y 1
e
y
;0 < y <
()
(9.185)
Determine the mean, mode and variance for this random variable.
9.12 Establish the following results that (i) if Y E (), then
X = Y 1/
is a W (, ) random variable; and (ii) conversely, that if X W (, ) then
Y = X
is an E () random variable.
9.13 A uidized bed reactor through which chlorine gas ows has a temperature
probe that fails periodically due to corrosion. The length of time (in days) during
which the temperature probe functions is known to be a Weibull distributed random
variable X, with parameters = 10; = 2.
(i) Determine the number of days each probe is expected to last.
326
Random Phenomena
(ii) If the reactor operator wishes to run a product campaign that lasts continuously
for 20 days, determine the probability of running this campaign without having to
replace the temperature probe.
(iii) What is the probability that the probe will function for anywhere between 10
and 15 days?
(iv) What is the probability that the probe will fail on or before the 10th day?
9.14 Suppose that the time-to-failure (in minutes) of certain electronic device components, when subjected to continuous vibrations, may be considered as a random
variable having a Weibull(, ) distribution with = 1/2 and = = 1/10: rst
nd how long we may expect such a component to last, and then nd the probability
that such a component will fail in less than 5 hours.
Section 9.2
9.15 Given two random variables X and Z related according to Eq (9.78), i.e.,
Z=
X x
,
x
determine the pdf for the random variable X and hence conrm Eq (9.91).
9.16 Given a Gaussian distributed random variable, X, with = 100; = 10,
determine the following probabilities:
(i) P (1.96 < X < 1.96) and P (3 < X < 3)
(ii) P (X > 123) and P (74.2 < X < 126)
9.17 Given Z, a standard normal random variable, determine the specic variate z0
that satises each of the following probability statements:
(i) P (Z z0 ) = 0.05; P (Z z0 ) = 0.025
(ii) P (Z z0 ) = 0.025; P (Z z0 ) = 0.10; P (Z z0 ) = 0.10
(iii) P (|Z| z0 ) = 0.00135
9.18 Given Z, a standard normal random variable, determine the following probabilities
(i) P (1.96 < Z < 1.96) and P (1.645 < Z < 1.645)
(ii) P (2 < Z < 2) and P (3 < Z < 3)
(iii) P (|Z| 1)
9.19 Consider the random variable X with the following pdf:
f (x) =
ex
(1 ex )2
(9.186)
327
i=1
possesses a 2 (n) distribution.
9.21 Show that if the random variable Y has a normal N (, 2 ) distribution, then
the random variable X dened as
X = eY
has a lognormal distribution, with parameters and , as shown in Eq (9.143).
Obtain an explicit relationship between (, ) and (, 2 ).
9.22 Revisit Exercise 9.21 and establish the reciprocal result that if the random
variable X has a lognormal L(, ), then the random variable Y dened as
Y = ln X
has a normal N (, 2 ) distribution.
9.23 Given a lognormal distributed random variable X with parameters = 0; =
2
; on the same graph, plot the pdf,
0.2, determine its mean, X , and variance, X
f (x), and that for the Gaussian random variable with the same mean and variance
as X. Compare the two plots.
9.24 Revisit Exercise 9.23. Compute P (X 1.96X < X < X + 1.96X ) from
a lognormal distribution. Had this random variable been mistakenly assumed to be
Gaussian with the same mean and variance, compute the same probability and compare the results.
9.25 Show that that if Y1 N (0, b2 ) and Y2 N (0, b2 ), then
%
X = Y12 + Y22
(9.187)
328
Random Phenomena
9.27 Conrm that if a random variable X has a Beta B(, ) distribution, the mode
of the pdf occurs at:
1
(9.188)
x =
+2
and hence deduce that (a) no mode exists when 0 < < 1, and + > 2; and
(b) that when a mode exists, this mode and the mean will coincide if, and only if,
= . (You may simply recall the expression for the mean given in the text; you
need not rederive it.)
9.28 The Beta-Binomial mixture distribution arises from a Binomial Bi(n, p) random variable, X, whose parameter p, rather than being constant, has a Beta distribution, i.e., it consists of a conditional distribution,
n x
f (x|p) =
p (1 p)nx
x
in conjunction with the marginal distribution for p,
f (p) =
( + ) 1
p
(1 p)1 ; 0 < x < 1; > 0; > 0
()()
a2
(b a)
9.30 For a random variable X that is uniformly distributed over the interval (a, b):
(i) Determine P (X > [a + (1 )b]); 0 < < 1;
(ii) For the specic case where a = 1, b = 3, determine P (X 2X < X < X +2X )
where X is the mean of the random variable, and X is the positive square root of
its variance.
(iii) Again for the specic case where a = 1, b = 3, nd the symmetric interval
(1 , 2 ) around the mean, X , such that P (1 < X < 2 ) = 0.95
9.31 Consider the random variable, X, with pdf:
f (x) = ( 1)x ; x 1; > 2
known as a Pareto random variable.
(i) Show that for this random variable,
E(X) =
1
; > 2
2
(9.189)
329
APPLICATION PROBLEMS
9.37 The waiting time in days between the arrival of tornadoes in a county in south
central United States is known to be an exponentially distributed random variable
whose mean value remains constant throughout the year. Given that the probability
is 0.6 that more than 30 days will elapse between tornadoes, determine the expected
number of tornadoes in the next 90 days.
9.38 The time-to-failure, T , of an electronic component is known to be an exponentially distributed random variable with pdf
et ; 0 < x <
(9.190)
f (t) =
0;
elsewhere
where the failure rate, = 0.075 per 100 hours of operation.
(i) If the component reliability function Ri (t) is dened as
Ri (t) = P (T > t)
(9.191)
330
Random Phenomena
i.e., the probability that the component functions at least up until time t, obtain an
explicit expression for Ri (t) for this electronic component.
(ii) A system consisting of two such components in parallel functions if at least one of
them functions. Again assuming that both components are identical, nd the system
reliability Rp (t) and compute Rp (1000), the probability that the system survives at
least 1,000 hours of operation.
9.39 Life-testing results on a rst generation microprocessor-based (computercontrolled) toaster indicate that X, the life-span (in years) of the central control
chip, is a random variable that is reasonably well-modeled by the exponential pdf:
f (x) = ex ; x > 0
(9.192)
Relative
Frequency
fr (x)
0.00
0.02
0.20
0.32
0.16
0.08
0.11
0.03
0.02
0.01
0.00
0.01
(i) Determine the mean (average) and variance of the CHO cells inter-origin distance.
(ii) If this is a gamma distributed random variable, use the results in (i) to provide
reasonable values for the gamma distribution parameters. On the same graph, plot
the frequency data and the gamma model t. Comment on the model t to the data.
7 Li, F., Chen, J., Solessio, E. and Gilbert, D. M. (2003). Spatial distribution and specication of mammalian replication origins during G1 phase. J Cell Biol 161, 257-66.
8 M. R. Birtwistle, (2008). Modeling and Analysis of the ErbB Signaling Network: From
Single Cells to Tumorigenesis, PhD Dissertation, University of Delaware.
331
(iii) It is known that DNA synthesis is initiated at replication origins, which are
distributed non-uniformly throughout the genome, at an average rate of r origins
per kb. However, in some mammalian cells, because there is a non-zero probability
that any particular replication origin will not re, some potential origins are skipped
over so that in eect, k of such skips must take place (on average) before DNA
synthesis can begin. What do the values estimated for the gamma distribution imply
about the physiological parameters r and k?
9.41 The storage time (in months) until a collection of long-life Li/SO4 batteries
become unusable was modeled in Morris (1987)9 as a Weibull distributed random
variable with = 2 and = 10. Let us refer to this variable as the batterys
maximum storage life, MSL.
(i) What is the most likely value of the MSL? (By denition, the most likely
value is that value of the random variable for which the pdf attains a maximum.)
(ii) What is the median MSL? By how much does it dier from the expected MSL?
(iii) What is the probability that a battery has an MSL value exceeding 18 months?
9.42 A brilliant paediatrician has such excellent diagnostic skills that without resorting to expensive and sometimes painful tests, she rarely misdiagnoses what ails
her patients. Her overall average misdiagnosis rate of 1 per 1,000 consultations is
all the more remarkable given that many of her patients are often too young to
describe their symptoms adequatelywhen they can describe them at all; the paediatrician must therefore often depend on indirect information extracted from the
parents and guardians during clinic visits. Because of her other responsibilities in
the clinic, she must limit her patient load to precisely 10 per day for 325 days a year.
While the total number of her misdiagnoses is clearly a Poisson random variable, the
Poisson parameter = t is not constant because of the variability in her patient
population, both in age and in the ability of parents and guardians to communicate
eectively on behalf of their non-verbal charges. If has a gamma distribution with
= 13, = 0.25,
(i) Determine the probability that the paediatrician records exactly 3 misdiagnoses
in a year; determine also the probability of recording 3 or fewer misdiagnoses.
(ii) Compare these probabilities with the corresponding ones computed using a standard Poisson model with a constant parameter.
9.43 The year-end bonuses of cash, stock and stock options (in thousands of US
dollars) given to senior technical researchers in a leading chemical manufacturing
company, each year over a ve-year period from 1990-1994, has a lognormal distribution with scale parameter = 3 and shape parameter = 0.5.
(i) Determine the probability that someone selected at random from this group
received a bonus of $20, 000 or higher.
(ii) If a bonus of $100, 000 or higher is considered a Jumbo bonus, what percentage
of senior technical researchers received such bonuses during this period?
(iii) If a bonus in the range $10, 000$30, 000 is considered more typical, what percentage received this typical bonus?
9 Morris, M.D. (1987). A sequential experimental design for estimating a scale parameter
from quantal life testing data. Technometrics, 29, 173-181
332
Random Phenomena
9.44 The home prices (in thousands of dollars) in a county located in the upper midAtlantic region of the United States is a lognormal random variable with a median
of 403 and a mode of 245.
(i) What percentage of the homes in this region cost more than $500,000?
(ii) If a home is considered aordable in this region if it costs between $150,000
and $300,000, what percentage of homes fall into this category?
(iii) Plot the pdf for this random variable. Compute its mean and indicate its value
on the plot along with the value given for the median. Which seems to be more
representative of the central location of the distribution, the mean or the median?
9.45 If the proportion of students who obtain failing grades on a foreign Universitys
highly competitive annual entrance examination can be considered as a Beta B(2, 3)
random variable,
(i) What is the mode of this distribution, and what percentage of students can be
expected to fail annually?
(ii) Determine the probability that over 90% of the students will pass this examination in any given year.
(iii) The proportion of students from an elite college preparatory school (located
in this same foreign country) who fail this same entrance examination has a Beta
B(1, 7) distribution. Determine the percentage of this select group of students who
can be expected to fail; also, determine the probability that over 90% of these elite
students will pass this examination in any given year.
(iv) Do these results mean that the elite college preparatory school does better in
getting its students admitted into this highly selective foreign University?
9.46 The place kicker on a team in the American National Football League (NFL)
has an all-time success rate (total number eld goals made divided by total number
of eld goals attempted) of 0.82 on eld goal attempts of 55 yards or shorter. An
attempt to quantify his performance with a probability model resulted in a Beta
B(4.5, 1) distribution.
(i) Is this model consistent with the computed all-time success rate?
(ii) To be considered an elite place kicker, the success rate from this distance (D
55) needs to improve to at least 0.9. Determine the probability that this particular
place kicker achieves elite status in any season, assuming that he maintains his
current performance level.
(iii) It is known that the computed probability of attaining elite status is sensitive
to the model parameters, especially . For the same xed value = 1, compute
the probability of attaining elite status for the values = 3.5, 4.0, 4.5, 5.0, 5.5. Plot
these probabilities as a function of .
9.47 If the uorescence signals obtained from a test spot and the reference spot on
a microarraya device used to quantify changes in gene expressionis represented
as random variables X1 and X2 respectively, it is possible to show that if these
variables can be assumed to be independent, then they are reasonably represented
by gamma distributions. In this case, the fold change ratio
Y =
X1
X2
(9.193)
indicative of the fold increase (or decrease) in the signal intensity between test
333
and reference conditions, has the inverted Beta distribution, with the pdf
f (y) =
( + )
y 1
; y > 0; > 0; > 0
()() (1 + y)+
(9.194)
The theoretical distribution with parameters = 4.8; = 2.1 t the fold change
ratio for a particular set of data. Because of the detection threshold of the measurement technology, the genes in question are declared to be overexpressed only
if Y 2; if 0.5 Y < 2, the conclusion is that there is insucient evidence of
dierential expression.
(i) Determine the expected fold change ratio.
(ii) Determine the probability that a gene selected at random from this population
will be identied as overexpressed.
(iii) Determine the probability that there will be insucient evidence to conclude
that there is dierential expression. (Hint: it may be easier to invoke the result that
the variable Z, dened as Z = X1 /(X1 + X2 ), has a Beta distribution with the
same values of and given for the inverted Beta distribution. In this case, the
probabilities can be computed in terms of Z rather than Y .)
9.48 The sample variance for the yield data presented in Chapter 1 may be determined as s2A = 2.05 for process A, and s2B = 7.62 for process B. If, but for random
variability in the data, these variances are the same, then the ratio
xAB =
s2A
s2B
0.025
P (X f2 )
0.025
(iii) What do these results imply about the plausibility of the conjecture that but
for random variability, the variances of the data obtained from the two processes are
in fact the same?
334
Random Phenomena
Chapter 10
Information, Entropy and Probability
Models
336
336
337
338
338
340
344
344
345
346
347
348
349
350
351
351
351
352
352
354
355
357
360
The dening characteristic of the random variable is that uncertainty in individual outcomes co-exists with regularity in the aggregate ensemble. This aggregate ensemble is conveniently characterized by the pdf, f (x); and, as shown
in the preceding chapters, given all there is to know about the phenomenon
behind the random variable, X, one can derive expressions for f (x) from rst
principles. There are many practical cases, however, where the available information is insucient to specify the full pdf. One can still obtain reasonable
probability models from such incomplete information, but this will require an
alternate view of the outcomes of random experiments in terms of the infor335
336
Random Phenomena
mation each conveys, from which derives the concept of the entropy of a
random variable. This chapter is concerned rst with introducing the concept
of entropy as a means of quantifying uncertainty in terms of the amount of
information conveyed by the observation of a random variables outcome. We
then subsequently present a procedure that utilizes entropy to specify full pdfs
in the face of incomplete information. The chapter casts most of the results
of the two previous chapters in a dierent context that will be of interest to
engineers and scientists.
10.1
10.1.1
Basic Concepts
337
Let us illustrate with the following example. Case 1 involves a bag containing exactly two balls, 1 red, 1 blue. For Case 2, we add to this bag 10 green,
10 black, 6 white, 6 yellow, 3 purple and 3 orange balls to bring the total to
40 balls. The experiment is to draw a ball from this bag and to consider for
each case, the event that the drawn ball is red. For Case 1, the probability
of drawing a red ball, P1 (Red), is 1/2; for Case 2 the probability P2 (Red) is
1/40. Drawing a red ball is therefore considered more informative in Case 2
than in Case 1.
Another perspective of what makes the drawing of a red ball more informative in Case 2 is that P2 (Red)=1/40 indicates that the presence of a red
ball in the Case 2 bag is a fact that will take a lot of trials on average to
ascertain. On the other hand, it requires two trials, on average, to ascertain
this fact in the Case 1 bag.
To summarize:
1. The information content of (or uncertainty associated with) the statement X = xi increases as P (X = xi ) decreases;
2. The greater the dispersion of the distribution of possible values of X, the
greater the uncertainty associated with the specic result that X = xi
and the lower the P (X = xi ).
We now formalize this qualitative conceptual discussion.
10.1.2
Quantifying Information
338
Random Phenomena
(10.2)
(10.3)
Note that this function satises all three conditions stated above.
10.2
Entropy
10.2.1
For the discrete random variable, X, and not just for a specic outcome
X = xi , Shannon suggests E[ log2 f (x)], the average or mean information
content, as a suitable measure of the information content in the pdf f (x). This
quantity, known as the entropy of the random variable, is dened by:
H(X) = E[ log2 f (x)] =
n
[ log2 f (xi )]f (xi )
(10.4)
i=1
Expressed in this form, H(X) has units of bits (for binary digits), a term
that harks back to the original application for which the concepts were
developeda problem involving the characterization of the average minimum binary codeword length required to encode the output of an information
source.
The expression for entropy is also sometimes written in terms of natural
logarithms (with the matching, if whimsical, unit of nats) as:
n
[ ln f (xi )]f (xi )
H(X) = E[ ln f (x)] =
(10.5)
i=1
One form diers from the other by only a multiplicative constant (specically
ln 2).
Example 10.1 ENTROPY OF A DETERMINISTIC VARIABLE
Compute the entropy of a variable X that takes on the value x0 with
1 Shannon, C.E. (1948). A mathematical theory of communication. Bell System Tech. J.,
27, 379-423 and 623-656.
339
(10.6)
1
; i = 1, 2, . . . , k.
k
(10.7)
Solution:
In this case
H(X)
=
=
1
1
k
k
i=1
1
1
k log2
= log 2 k
k
k
k
log2
(10.8)
(10.9)
1 p x = 0;
p
x = 1;
f (x) =
(10.10)
0
elsewhere.
Solution:
By denition, the entropy for this random variable is
H(X) = (1 p) log 2 (1 p) p log2 p
(10.11)
340
Random Phenomena
H(x) vs p
1.0
0.8
H(x)
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
10.2.2
f (x)dx
(10.12)
Let us now consider the case in which the interval [a, b] is divided into n
subintervals of equal length x, so that the ith subinterval Ai = {x : xi <
x < xi + x}. Then, we recall that, for suciently small x, the probability
341
(10.13)
And since A is the union of n disjoint sets Ai , it follows from Eq (10.13) that:
P (A) =
n
P (Ai )
i=1
n
f (xi )x
(10.14)
i=1
from where we obtain the familiar result that in the limit as the quantization
interval length x 0, the sum in (10.14) approaches the Riemann integral
in (10.12), and in addition, the approximation error vanishes. But as it stands,
the expression in (10.13) is a statement of the dierential probability that
X takes on a value in the dierential interval between xi and xi + x for
small, but non-zero, x.
Now, let Q(X) be a quantization function that places the continuous random variable anywhere in any one of the n quantized subintervals such that
by Q = xi we mean that xi < X < xi + x, or X Ai . Then P (Q = xi ) is
given by:
P (Q = xi ) = P (xi < X < xi + x) f (xi )x
(10.15)
as in Eq (10.13). Since Q is discrete, we may compute its entropy as:
H(Q) =
n
log2 P (Q = xi )P (Q = xi )
(10.16)
i=1
n
i=1
= log2 (x)
n
(10.17)
i=1
H(X)
=
(10.18)
but this is not enough to prevent the entropy H(Q) from increasing
without limit because of the log2 (x) term. Thus, not surprisingly, the
exact (not approximate) entropy of a continuous random variable again
turns out to be innite, as noted earlier.
342
Random Phenomena
(10.19)
then it follows from Eq (10.17) that for the quantized random variable,
log2 h(Q) log2 (x)
n
(10.20)
i=1
so that:
log2 [h(Q)x]
n
(10.21)
i=1
H(X)
= log2 [h(X)dx]
(10.22)
H(X) =
[ log2 f (x)]f (x)dx
(10.23)
343
(10.25)
(10.26)
which should be compared with the entropy obtained for the discrete
uniform random variable in Example 10.2.
344
Random Phenomena
We now discuss how these concepts can be used to obtain useful probability
models in the face of incomplete knowledge.
10.3
When only limited information (perhaps in the form of its range, mean,
, variance, 2 , or more generally, the expectation of some function G(X))
is all that is available about a random variable X, clearly it is not possible
to specify the full pdf f (x) uniquely because many pdfs exist that have the
same range, or mean or variance, or whatever partial information has been
supplied. The required full but unknown pdf contains additional information
over and above what is legitimately known. To postulate a full pdf from such
partial information therefore requires that we incorporate extra information
to ll in what is missing. The problem at hand may be stated as follows:
How should we choose an appropriate f (x) to use in representing the random variable X given only a few of its characteristic
parameters (, 2 , range, . . .) and nothing else?
The maximum entropy principle states that the f (x) that adds the least
amount of extra information, (i.e., the one with maximum entropy) should
be chosen. The resulting pdf, f (x), is then referred to as the maximally
unpresumptive distribution because, of all the possible pdfs with the same
characteristic parameters as those specied in the problem, f (x) is the least
presumptive. Such a pdf is also called a maximum entropy model and the
most common ones will now be derived.
10.4
The procedure for obtaining maximum entropy models involves posing the
optimization problem:
max {H(X) = E[ ln f (x)]}
f (x)
(10.27)
345
(where we have chosen the entropy function representation in nats for convenience) and solving it subject to the known information as constraints, as
we now illustrate for several cases.
10.4.1
We begin with a random variable X for which the only known fact is that
it can take on any of k discrete values x1 , x2 , . . . , xk . What is an appropriate
probability model for such as random variable?
Problem statement: Obtain f (xi ), i = 1, 2, . . . , k, given only that
k
f (xi ) = 1.
(10.28)
i=1
k
(10.29)
i=1
k
[ ln f (xi )]f (xi )
i=1
k
f (xi ) 1
(10.30)
i=1
where is a Lagrange multiplier. The optimum f and the optimum value for
are obtained from the Euler equations:
= 0
(10.31)
= 0
(10.32)
where the second equation merely recovers the constraint. In the particular
case at hand, we have
1
=
f (xi ) + ln f (xi ) = 0
(10.33)
f
f (xi )
which yields, upon simplication,
f (xi ) = e(1+) = C
2 Weinstock,
(10.34)
346
Random Phenomena
f (xi ) =
i=1
k
C=1
(10.35)
i=1
so that
1
(10.36)
k
with the nal result that the maximum entropy distribution for this discrete
random variable X is:
C=
f (xi ) =
1
; i = 1, 2, 3, . . . , k
k
(10.37)
Thus: the maximum entropy principle assigns equal probabilities to each of the
k outcomes of a discrete random variable when nothing else is known about
the variable.
This is a result in perfect keeping with intuitive common sense; in fact
we have made use of it several times in our previous discussions. Note also
that in Example 10.3, we saw that the Bernoulli random variable attains its
maximum entropy for equiprobable outcomes.
10.4.2
f (xi )
= 1
(10.38)
xi f (xi )
(10.39)
i=1
i=1
(f ) =
[ ln f (xi )]f (xi ) 1
f (xi ) 1 2
xi f (xi )
i=1
i=1
i=1
(10.40)
and the resulting Euler equations are:
= (ln f (xi ) + 1) 1 2 xi = 0
f
(10.41)
along with the other partial derivatives with respect to the Lagrange multipliers 1 , 2 that simply recover the two constraints. From Eq (10.41), we
347
obtain:
f (xi )
xi
= e(1+1 ) e2
= abxi
(10.42)
abx = 1
(10.43)
x=1
b
1b
=1
(10.44)
Similarly,
xabx =
(10.45)
b
=
(1 b)2
(10.46)
x=1
yields:
a
1
1
1
(10.47)
(10.48)
= (1 p)
p
=
1p
(10.49)
(10.50)
with the nal result that the pdf we seek is given by:
f (x) = p(1 p)x1
(10.51)
348
10.4.3
Random Phenomena
f (x)dx = 1
(10.52)
[ ln f (x)]f (x)dx
(10.53)
[ ln f (x)]f (x)dx
(f ) =
a
f (x)dx 1
(10.54)
(ln f (x) + 1) = 0
(10.55)
(10.56)
again with the latter recovering the constraint. From (10.55) we obtain:
f (x) = e(1+) = c
(10.57)
cdx = c[b a] = 1
(10.58)
1
ba
(10.59)
or,
c=
Hence, the prescribed pdf is:
f (x) =
1
;a < x < b
ba
(10.60)
10.4.4
349
f (x)dx
(10.61)
xf (x)dx
(10.62)
The Lagrangian for maximizing the dierential entropy in this case is:
(f ) =
[ ln f (x)]f (x)dx1
f (x)dx 1 2
xf (x)dx
(10.63)
= (ln f (x) + 1) 1 2 x = 0
f
(10.64)
along with the constraints in (10.61) and (10.62). From (10.64) we obtain:
f (x)
=
=
e(1+1 ) e2 x
C1 e2 x
(10.65)
C1 e2 x dx
C1 xe2 x dx
(10.66)
(10.67)
1 x/
e
x0
otherwise
(10.68)
This is recognizable as the pdf of an exponential random variable (the continuous version of the result obtained earlier in section 9.4.2). Thus: the maximum
entropy principle prescribes an exponential pdf for the continuous random variable for which only the mean is known.
350
10.4.5
Random Phenomena
xf (x)dx =
(x )2 f (x)dx = 2
(10.69)
(10.70)
(10.71)
3
(x )2 f (x)dx 2
(10.72)
= ln f (x) 1 1 2 x 3 (x )2 = 0
f
(10.73)
along with the three constraints in (10.69 10.71). Solving (10.73) gives:
f (x) = C1 e2 x e3 (x)
Substituting this back into the constraints and using the result:
)
2
eau =
a
(10.74)
(10.75)
2
0;
1
2 2
(10.76)
(10.77)
(10.78)
1
f (x) = e
2
(x)2
22
(10.79)
the familiar Gaussian pdf. Thus, when only the mean, , and the variance
2 are all that we legitimately know about a continous random variable, the
maximally unpresumptive distribution f (x) is the Gaussian pdf.
10.4.6
351
f (x) =
10.5
10.5.1
(10.80)
for discrete X, or
G(x)f (x)dx =
(10.82)
for continuous X. Given this single piece of information, expressions for the
resulting maximum entropy models for any G(X) will now be derived.
Discrete Case
The Lagrangian in the discrete case is given by:
n
n
(f ) =
[ ln f (xi )]f (xi )
f (xi )G(xi )
i=1
(10.83)
i=1
f (xi )
(f ) =
f (xi ) ln
CeG(xi )
i=1
(10.84)
(10.85)
352
Random Phenomena
This, then, is the desired maximum entropy model for any function G(X) of
the discrete random variable, X; the constants C and are determined such
that f (xi ) satises the constraint:
f (xi ) = 1
(10.86)
i=1
f (x)
(f ) =
f (x) ln
dx
CeG(x)
(10.88)
(10.89)
Again, this represents the maximum entropy model for any function G(X)
of the continuous random variable X, with the indicated constants to be
determined to satisfy the constraint:
f (x)dx = 1
(10.90)
(10.91)
10.5.2
353
Multiple Expectations
Gj (xi )f (xi ) = j ; j = 1, 2, . . . , m;
(10.92)
Gj (x)f (x)dx = j ; j = 1, 2, . . . , m;
(10.93)
i=1
for discrete X, or
for continuous X, so that in each case, j are known constants, then the
Lagrangians are obtained as
n
n
m
(f ) =
[ ln f (xi )]f (xi )
j
f (xi )Gj (xi ) j
(10.94)
i=1
j=1
m
(f ) =
[ ln f (x)]f (x)dx
i=1
j
j=1
f (x)Gj (x)dx j
(10.95)
It can be shown that these Lagrangians are maximized for pdfs given by:
m
f (xi ) = Ce[
for discrete X, and
j=1
m
f (x) = Ce[
j=1
j Gj (xi )]
j Gj (x)]
(10.96)
(10.97)
for continuous X, generalizing the results in Eqs (10.85) and (10.89). These results are from a theorem by Boltzmann (18441906), the Austrian theoretical
physicists credited with inventing statistical thermodynamics and statistical
mechanics. The constant
C in
each case is the normalizing constant deter
mined such that f (x)dx and
f (xi ) equal 1; the m Lagrange multipliers
1 , 2 , . . . , m are obtained from solving simultaneously the m equations representing the known expectations in Eqs (10.92) and (10.93).
The following are two applications of this set of results. Consider the case
where, for a continuous random variable X,
G1 (X) = X
(10.98)
G2 (X) = ln X
(10.99)
(10.100)
and
(10.101)
354
Random Phenomena
(10.102)
1 x 1
e x
()
(10.103)
(10.104)
G2 (X) = ln(1 X)
(10.105)
and
E[G1 (X)] =
E[G2 (X)] =
() ( + )
()
( + )
() ( + )
()
( + )
(10.106)
(10.107)
(10.108)
( + ) 1
x
(1 x)1
()()
(10.109)
10.6
This chapter has been concerned with the problem of how to determine
appropriate (and complete) probability models when only partial information is available about the random variable in question. The rst-principles
approach to probability model development discussed in the earlier chapters
(Chapter 8 for discrete random variables and Chapter 9 for the continuous
type), is predicated on the availability of complete phenomenological information about the random variable of interest. When this is not the case, and only
partial information is available, model development must be approached differently. This chapter has oered such an alternative approachone based on
the maximum entropy principle. The essence of this principle is that of all
355
the several pdfs whose characteristics are consistent with the available partial
information, the one that adds the least amount of extraneous information
the least presumptiveshould be chosen as an appropriate model. To realize
this intuitively appealing concept fully in practice, of course, requires advanced optimization theory, much of which is not expected to be familiar to
the average reader. Still, enough of the derivation details have been presented
to allow the reader to appreciate how the results came to be.
It is interesting to note now in retrospect that all the results presented
in this chapter have involved familiar pdfs encountered previously in earlier
discussions. This should not give the impression that these are the only useful
maximum entropy distributions; neither should this be construed as implying
that all pdfs encountered previously have maximum entropy interpretations.
The scope of coverage was designed rst to demonstrate to (and inspire condence in) the reader that this approach, even though somewhat esoteric,
does in fact lead to results that make sense. Secondly, this coverage was
also designed to oer a dierent perspective of some of these familiar models.
For instance, as a model for residence time distribution in chemical reaction
engineering, we have seen the exponential distribution arise from chemical
engineering arguments (Chapter 2), probability arguments (Chapter 9), and
now from maximum entropy considerations. The same is true for the geometric
distribution as a model for polymer chain length distribution (see Application
Problem 10.12). But the application of this principle in practice extends well
beyond the catalog of familiar results shown here, for example, see Phillips et
al., (2004)3 for an application to the problem of modeling geographic distributions of species, a critical problem in conservation biology.
With the discussion in this chapter behind us, we have now completed
our study of probability models and their development. The discussion in the
next chapter is a case study illustrating how probability models are developed,
validated and applied in solving the complex and important practical problem
of optimizing the eectiveness of in-vitro fertilization.
The main points and results of this chapter are summarized in Table 10.1.
REVIEW QUESTIONS
1. What are the three axioms employed in quantifying the information content of
the statement P (X = xi ) = pi ?
2. In what ways are the axioms of information content akin to the axioms of
probability encountered in Chapter 4?
3. What is the entropy of a discrete random variable X with a pdf f (x)?
3 S. J. Phillips, M. Dud
k and R. E. Schapire, (2004) A Maximum Entropy Approach to
Species Distribution Modeling, Proc. Twenty-First International Conference on Machine
Learning, 655-662.
356
Random Phenomena
TABLE 10.1:
Summary of
Known
Random Variable
Characteristics
Discrete
Binary (0,1)
Discrete
Range: i = 1, 2, . . . , k
Discrete
Mean,
Continuous
Range: a x b
Continuous
Mean,
Continuous
Mean, ; Variance, 2
Continuous
Mean, ; Variance, 2
Range: 0 x 1
G(X) = 0; a x b
E[G(X)] = 0
G1 (X) = X; G2 (X) = ln X
E[G1 (X)] =
f (x) =
1
(ba)
f (x) =
1
x 1
x
() e
f (x) =
(+) 1
(1
()() x
()
E[G2 (X)] = ()
G1 (X) = ln X
G2 (X) = ln(1 X)
()
E[G1 (X)] = ()
E[G2 (X)] =
()
()
(+)
(+)
(+)
(+)
Beta
B(, );
x)1
357
4. Why is entropy as dened for discrete random variables not very useful for continuous random variables?
5. What is the corresponding entropy concept for continuous random variables?
6. What is quantization and why must there always be a trade-o between quantization accuracy and entropy of a continuous random variable?
7. What is the dierential entropy of a continuous random variable?
8. What is the entropy of a random variable about which nothing is known?
9. What is the entropy of a variable with no uncertainty?
10. What eect does any additional information about a random variable have on
its entropy?
11. Provide a succinct statement of the primary problem of this chapter.
12. What is the maximum entropy principle for determining full pdfs when only
partial information is available?
13. What is the maximum entropy distribution for a discrete random variable
X for which the only known fact is that it can take on any of k discrete values
x1 , x2 , . . . , x k ?
14. What is the maximum entropy distribution for a discrete random variable X for
which the mean is known and nothing else?
15. What is the maximum entropy distribution for a continuous random variable X
for which the only known fact is its range, VX = {x : a < x < b}?
16. What is the maximum entropy distribution for a continuous random variable X
for which the mean is known and nothing else?
17. What is the maximum entropy distribution for a continuous random variable X
for which the mean, , and variance, 2 , are known and nothing else?
18. What is the maximum entropy distribution for a continuous random variable
X for which the range, (0,1), mean, , and variance, 2 , are known and nothing else?
19. Which two equations arise from a theorem of Boltzmann and how are they used
to obtain maximum entropy distributions?
358
Random Phenomena
EXERCISES
10.1 Using the principles of dierential calculus, establish that the entropy of the
Bernoulli random variable, shown in Eq (10.11), i.e.,
H(X) = (1 p) log 2 (1 p) p log2 p
is maximized when p = 0.5.
10.2 Determine the maximum entropy distribution for the Binomial random variable X, the total number of successes in n Bernoulli trials, when all that
is known
is that with each trial, there are exactly only two outcomes. (Hint: X = n
i=1 Xi ,
where Xi is a Bernoulli random variable.)
10.3 Determine the entropy for the geometric random variable, G(p), with the pdf
f (x) = pq x1 ; x = 1, 2, . . .
and compare it to the entropy obtained for the Bernoulli random variable in Example 10.3 in the text.
10.4 Show that the entropy for the exponential random variable, E (), with pdf
f (x) =
1 x/
e
;0 < x <
is given by:
H(X) = 1 + ln
(10.110)
10.5 Show that the entropy for the Gamma random variable, (, ), with pdf
f (x) =
1
ex/ x1 ; 0 < x <
()
is given by:
H(X) = + ln + ln () + (1 )
()
()
(10.111)
Directly from this result, write an expression for the entropy of the 2 (r) random
variable.
10.6 Show that the entropy for the Gaussian N (, 2 ) random variable with pdf
f (x) =
is given by:
(x)2
1
e 22
2
H(X) = ln 2e
(10.112)
and hence establish that the entropy of a Gaussian random variable depends only
on and not . Why does this observation make sense? In the limit as ,
what happens to the entropy, H(X)?
359
10.7 Show that the entropy for the Lognormal L(, 2 ) random variable with pdf
(ln x )2
1
exp
f (x) =
;0 < x <
2 2
x 2
is given by:
H(X) = ln 2e +
(10.113)
Compare this with the expression for the entropy of the Gaussian random variable in
Exercise 9.4, Eq (10.112). Why does the entropy of the Lognormal random variable
depend linearly on while the entropy of the Gaussian random variable does not
depend on the corresponding parameter, at all?
10.8 The maximum entropy distribution for a random variable X for which G(X)
and its expectation, E[G(X)], are specied, was given in Eq (10.85) for discrete X,
and Eq (10.89) for continuous X i.e.,
f (x) =
CeG(xi ) ;
CeG(x) ;
for discrete X
for continous X
Determine f (x) completely (i.e., determine C and explicitly) under the following
conditions:
(i) G(Xi ) = 0; i = 1, 2, . . . , k, a discrete random variable for which nothing is known
except its range;
(ii) G(X) = 0; a < X < b;
(iii) G(Xi ) = X; i = 1, 2, . . .; E[G(X)] =
(iv) G(X) = X; 0 < x < ; E[G(X)] =
10.9 For the continuous random variable X for which
G(X) = (X )2
is specied along with its expectation,
E[G(X)] = E[(X )2 ] = 2 ,
the maximum entropy distribution was given in Eq (10.89) as:
f (x) = CeG(x) ;
Show that the constants in this pdf are given by:
C
2
1
2 2
(10.114)
(10.115)
360
Random Phenomena
1
(x )2
exp
2 2
2
You may nd the following identity useful:
)
2
eau du =
a
(10.116)
f (x) =
(10.117)
()
()
it was stated in the text that the maximum entropy pdf prescribed by Eq (10.97) is:
f (x) = Ce[1 x2 ln x] = Cx2 e1 x
(10.118)
Determine the constants C, 1 and 2 and hence establish the result given in Eq
(10.103).
10.11 Revisit Exercise 10.10 for the case where the information available about the
random variable X is:
X
X
G1 (X) = ; and G2 (X) = ln
along with
()
E[G1 (X)] = ; > 0; and E[G2 (X)] =
()
obtain an explicit expression for the maximum entropy pdf in this case.
APPLICATION PROBLEMS
10.12 In certain polymerization processes, the polymer product is made by the
sequential addition of monomer units to a growing chain. At each step, after the
chain has been initiated, a new monomer may be added, propagating the chain,
or a termination event can occur, stopping the growth; whether the growing chain
propagates or terminates is random. The random nature of the propagation and
termination events is responsible for polymer products having chains of variable
lengths. As such, because it is a count of the number of monomer units in the chain,
X, the length of a particular polymer chain, is a discrete random variable.
Now consider the case where the only information available about a particular
process is the kinetic rate of the termination reaction, given as RT per min, which
can be interpreted as implying that an average of RT chain terminations occur per
min. By considering the reciprocal of RT , i.e.,
p=
1
RT
as the probability that a termination reaction will occur, obtain f (x), a maximum
361
entropy distribution for the polymer chain length, in terms of p. This distribution
is often known as the most probable chain length distribution.
10.13 As introduced very briey in Chapter 2, the continuous stirred tank reactor
(CSTR), a ubiquitous equipment used in the chemical industry to carry out a wide
variety of chemical reactions, consists of a tank of volume V liters, through which
the reactant stream ows continuously at a rate of F liters/sec; the content is vigorously stirred to ensure uniform mixing, and the product is continuously withdrawn
from the outlet at the same rate, F liters/sec. Because of the vigorous mixing, the
amount of time any particular uid element spends in the reactorthe reactor residence timevaries randomly, so that there is in fact not a single value for residence
time, X, but a distribution of values. Clearly, the residence time aects the productivity of the reactor, and characterizing it is a central concern in chemical reaction
engineering.
Now, a stream continuously fed at a rate F liters/sec through a reactor of volume
V liters implies an average residence time, in secs, given by
=
V
F
Given only this information, obtain a maximum entropy distribution for the residence time in the CSTR, and compare it with the result in Section 2.1.2 of Chapter 2.
10.14 Integrins are transmembrane receptors that link the actin cytoskeleton of a
cell to the extra cellular matrix (ECM). This connection, which constantly and dynamically reorganizes in response to mechanical, chemical, and other environmental
cues around the cell, leads to lateral assembly of integrins into small stationary focal complexes or clusters of integrins. Integrin clustering, an extremely important
process in cell attachment and migration, is a stochastic process that results in heterogenous populations of clusters that are best characterized with distributions. One
of the many characteristics of an integrin cluster is its shape. Because integrin clusters grow or shrink in dierent directions depending on the orientation and tension
of the actin cytoskeleton, the shape of an integrin cluster provides useful information
concerning the forces acting on a particular adhesive structure.
The shape of integrin clusters is often idealized as an ellipse and quantied by
its eccentricity, , the ratio of the distance between the foci of the representative
ellipse to the length of its major axis. This quantity has the following properties:
1. It is scaled between 0 and 1, i.e., 0 1;
2. 1 for elongated clusters; for circular clusters, 0;
3. Physiologically, integrin clusters in adherent cells tend to be more elongated
than circular; non-adherent cells tend to have more circular integrin clusters.
Given that the average and variance of cluster eccentricity is often known for a particular cell (and, in any event, these can be determined experimentally), obtain a
maximum entropy distribution to use in representing this aspect of integrin clustering.
Data obtained by Welf (2009)4 from Chinese Hamster Ovary (CHO) cells stably expressing the integrin IIb3, indicated an average eccentricity = 0.92 and
4 Welf, E. S. (2009). Integrative Modeling of Cell Adhesion Processes, PhD Dissertation,
University of Delaware.
362
Random Phenomena
variance, 2 = 0.003 for the particular collection of integrin clusters studied. From
this information, obtain a specic theoretical pdf that characterizes the size distribution of this experimental population of integrin clusters and determine the mode
of the distribution. Plot the pdf; and from the shape of this distribution, comment
on whether you expect the specic clusters under study to belong to adherent or
non-adherent cells.
10.15 Mee (1990)5 presented the following data on the wall thickness (in ins) of
cast aluminum cylinder heads used in aircraft engine cooling jackets. The mean and
variance of the wall thickness are therefore considered as known. If a full pdf is to
be prescribed to characterize this important property of the manufactured cylinder
heads, use the maximum entropy principle to postulate one. Even though there are
only 18 data points, plot the theoretical pdf versus a histogram of the data and
comment on the model t.
0.223
0.201
0.228
0.223
0.214
0.224
0.193
0.231
0.223
0.237
0.213
0.217
0.218
0.204
0.215
0.226
0.233
0.219
10.16 The total number of occurrences of a rare event in an interval of time (0, T ),
when the event occurs at a mean rate, per unit time, is known to be a Poisson
random variable. However, given that an event has occurred in this interval, and
without knowing exactly when the event occurred, determine a maximum entropy
distribution for the time of occurrence of this lone event within this interval.
Now let X be the time of occurrence in the normalized unit time interval, (0,1).
Using the just-obtained maximum entropy distribution, derive the distribution for
the log-transformed variable,
1
(10.119)
Y = ln X
Interpret your result in terms of what you know about the Poisson random variable
and the inter-arrival times of Poisson events.
5 Mee, R. W., (1990). An improved procedure for screening based on a correlated, normally distributed variable, Technometrics, 32, 331337.
Chapter 11
Application Case Studies II: In-Vitro
Fertilization
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2 In-Vitro Fertilization and Multiple Births . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2.1 Background and Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2.2 Clinical Studies and Recommended Guidelines . . . . . . . . . . . . . . . . . . .
Factors aecting live-birth and multiple-birth rates . . . . . . . . . . . .
Prescriptive Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Determining Implantation Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Qualitative Optimization Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3 Probability Modeling and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3.1 Model Postulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4 Binomial Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4.1 Overview and Study Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4.2 Binomial Model versus Clinical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.5 Problem Solution: Model-based IVF Optimization and Analysis . . . . . . . .
11.5.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.5.2 Model-based Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.5.3 Patient Categorization and Theoretical Analysis of Treatment
Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.6 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.6.1 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.6.2 Theoretical Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7.1 Final Wrap-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7.2 Conclusions and Perspectives on Previous Studies and Guidelines
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PROJECT ASSIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
364
365
365
367
367
368
368
369
371
371
372
373
375
375
377
384
385
386
387
392
392
394
395
395
397
398
399
364
Random Phenomena
(IVF), an iconic 20th century creature of whose future existence the designers of the garments of probability distributions (specically, the binomial
distribution) were utterly oblivious a few centuries ago. Our surprise and delight do not end upon discovering just how well the binomial pdf model ts,
as if custom-made for IVF analysis; there is the additional and completely
unexpected bonus discovery that, with no modication or additional embellishments, this model also is perfectly suited to solving the vexing problem of
maximizing the chances of success while simultaneously minimizing the risk
of multiple births and total failure. This chapter is the second in the series
of case studies designed to illustrate how the probabilistic framework can be
used eectively to solve complex, real-life problems involving randomly varying phenomena.
11.1
Introduction
When Theresa Anderson learned that she was pregnant with quintuplets,
she was dumbfounded. She had signed up to be a surrogate mother for an
infertile couple and during an in vitro fertilization procedure doctors introduced
ve embryos into her womb. They told me there was a one in 30 chance that
one would take, she says. Instead, all ve took. The 26-year-old mother of
two endured a dicult pregnancy and delivered the boys in April [2005] at a
Phoenix hospital. The multiple births made headlines across the country as a
feel-good tale. But they also underscore a reality of the fertility business: Many
clinics are implanting more than the recommended number of embryos in their
patients, raising the risks for women.
So began an article by Sylvia Pag
an Westphal that appeared in the Wall
Street Journal (WSJ) on October 7, 2005. In-vitro fertilization (IVF), the very
rst of a class of procedures collectively known as Assisted Reproductive
Technology (ART), was originally developed specically to treat infertility
caused by blocked or damaged fallopian tubes; it is now used to treat a variety of infertility problems, with impressive success. With IVF, eggs and sperm
are combined in a laboratory to fertilize in vitro (literally in glass). The
fertilized eggs are later transferred into the womans uterus, where, in the successful cases, implantation, embryo development, and ultimately, a live birth,
will occur as with all other normal pregnancies. Since 1978 when the rst
so-called test-tube baby was born, IVF has enabled an increasing number
of otherwise infertile couples to experience the joy of having children. So successful in fact has IVF been that its success rates now compare favorably to
natural pregnancy rates in any given month, especially when the woman is
under 40 years of age and there are no sperm problems.
With the advent of oocyte donation (Yaron, et al, 1997; Reynolds, et al,
2001), where the eggs used for IVF have been donated typically by younger
365
women, the once formidable barriers to success due to age or ovarian status
are no longer as serious. Today, ART with oocyte donation is one of the most
successful treatment programs for infertility. Recent studies have reported
pregnancy rates as high as 48% per retrieval: out of a total of 6,936 IVF
procedures using donated oocytes carried out in 1996 and 1997, 3,320 resulted
in pregnancies and 2,761 live-birth deliveriesan astonishing success rate (in
terms of deliveries per retrieval) of 39.8% (Reynolds, et al., 2001).
However, as indicated by the WSJ article, and a wide variety of other
clinical studies, including the Reynolds, et al., 2001, study noted above, IVF
patients are more likely to have multiple-infant births than women who conceive naturally; furthermore, these multiple pregnancies are known to increase
the risks for a broad spectrum of problems, ranging from premature delivery
and low birth weight, to such long-term disabilities as cerebral palsy, among
surviving babies. For example, Patterson et al, 1993 report that the chance of
a twin pregnancy resulting in a baby with cerebral palsy is 8 times that of a
singleton birth.
The vast majority of pregnancies with three or more babies are due to
IVF and other such assisted reproductive technologies (fewer than 20 %
arise from natural conception); and such multiple births contribute disproportionately to infant and maternal morbidity and mortality rates, with corresponding increased contributions to health care costs. Consequently, many
national and professional organizations in the U.S., Canada and other western
countries have provided guidelines on the number of embryos to transfer, in
an attempt to balance the desire for success against the risk of multiple births.
The primary objective of this chapter is to examine the fundamental problem of multiple births in IVF from a probabilistic perspective. In what follows, rst we review a few representative clinical studies and recommended
IVF practice guidelines, and then develop a probability model for IVF and
validate it against clinical data. Finally, with the probability model as the basis, we pose and then solve the optimization problem of maximizing the
chances for success while simultaneously reducing the risk of multiple births.
The various consensus qualitative recommendations and guidelines are then
interpreted in the context of the probability model and the optimal solution
obtained from it.
11.2
11.2.1
366
Random Phenomena
367
be made to minimize the risk of multifetal gestation while maintaining a high probability of healthy live birth.1
The extent of the results of these studies (reviewed shortly) is a collection of
sound recommendations, based on careful analyses of specic clinical data sets
to be sure, but no explicit solution to the optimization problem. To the best
of our knowledge, to date, there is in fact no systematic, explicit, quantitative
solution to the IVF optimization problem whereby the optimum number of
embryos to transfer can be prescribed concretely for each individual patient.
Such a solution is developed in this chapter.
11.2.2
The literature on the topic of Assisted Reproduction, even when restricted to papers that focus explicitly on the issue of IVF and multiple births,
is quite extensive. This fact alone makes an exhaustive review next to impossible within the context of this chapter. Nevertheless, it is possible to discuss a
few key papers that are particularly relevant to the objectives of this chapter
(the application of probability models to problems of practical importance).
Factors aecting live-birth and multiple-birth rates
The rst group of papers, exemplied by Schieve et al., 1999; Engmann,
et al., 2001; Reynolds et al, 2001; Jansen, 2003; and Vahratian, et al., 2003,
use retrospective analyses of various types of clinical IVF data to determine
what factors inuence live-birth rates and the risk of multiple births. The main
conclusions in each of these studies were all consistent and may be summarized
as follows:
1. Patient age and the number of embryos transferred independently affected the chances for live birth and multiple birth.
2. In general, live-birth rates increased if more than 2 embryos are transferred.
3. The number of embryos needed to achieve maximum live birth rates
varied with age. For younger women (age < 35 years), maximum livebirth rates were achieved with only two embryos transferred; for women
age > 35 years, live birth rates were lower in general, increasing if more
than 2 embryos were transferred.
4. Multiple birth rates generally increased with increased number of embryos transferred but in an age-dependent manner, with younger women
(age < 35 years) generally showing higher multiple-birth risks than older
women.
1 Guidelines for the Number of Embryos to Transfer Following In Vitro Fertilization, J
Obstet Gynaecol Can 2006;28 (9)799-813
368
Random Phenomena
5. Special Cases: For IVF treatments using donor eggs, the age of the donor
rather than maternal age was more important as a determinant of the
risk of multiple-birth (Reynolds, et al. 2001). Also, success rates are
lower in general with thawed embryos than with fresh ones (Vahratian,
et al.,2003).
These conclusions, supported concretely by clinical data and rigorous statistical analyses, are, of course, all perfectly in line with common sense.
Prescriptive Studies
The next group, exemplied by Austin, et al, 1996; Templeton and Morris, 1998; Strandel et al, 2000; and Thurin et al., 2004, are more prescriptive
in that each in its own way sought to provide explicit guidancealso from
clinical dataon the number of embryos to transfer in order to limit multiple
births. A systematic review in Pandian et al., 2004 specically compares the
eectiveness of elective two-embryo transfer with single-embryo transfer and
transfers involving more than 2 embryos. The Thurin et al., 2004, study is
unique, being representative of a handful of randomized prospective studies
in which, rather than analyze clinical data after the fact (as with other
studies), they collected their own data (in real-time) after assigning the treatment applied to each patient (single-embryo transfer versus double-embryo
transfer) in randomized trials. Nevertheless, even though these studies were
all based on dierent data sets from dierent clinics, utilized dierent designs, and employed dierent methods of analysis, the conclusions were all
remarkably consistent:
1. The risk of multiple births increased with increasing number of (good
quality) embryos transferred, with patients younger than 40 at higher
risk;
2. The rate of multiple births can be reduced signicantly by transferring
no more than two embryos;
3. By performing single-embryo transfers (in selected cases), the rate of
multiple births can be further reduced, although at the expense of reduced rate of live births;
4. Consecutive single-embryo transfers (one fresh embryo transfer
followedin the event of a failure to achieve term pregnancyby one
additional frozen-and-thawed embryo transfer) achieves the same signicant reduction possible with single-embryo transfer without lowering the
rate of live births substantially below that achievable with a one-time
double-embryo transfer.
369
370
Random Phenomena
IVF success rates without unduly increasing the risk of multiple births. The
key conclusions of the study are:
1. Ideally, a single embryo transfer would be optimal if the implantation
rates (per embryo) were as high as 50%;
2. Embryos should be screened, and only the few with high potential implantation rates should be selected for transfer.
3. No more than two embryos should be transferred per attempt. To oset the potential lowering of the IVF success rate, the rest should be
cryopreserved for subsequent frozen-thaw embryo transfer.
Because it is comprehensive and detailed, but especially because its presentation is particularly appropriate, the results from this study are used to
validate the model presented in the next section.
The nal category to discuss are the guidelines and policy recommendations developed by professional organizations and various governmental agencies in western nations particularly the US, Canada, the United Kingdom, and
Sweden. For example, in 1991, the British Human Fertilisation and Embryology Authority (HFEA) imposed a legal restriction on the number of allowable
embryos transferred to a maximum of 3; Sweden, in 1993 recommended a further (voluntary) reduction in the number of embryos transferred from 3 to 2.
The American Society of Reproductive Medicine recommended in 19992 that
no more than two embryos should be transferred for women under the age
of 35 who produce healthy embryos; three for those producing poor embryos.
A further tightening of these (voluntary) recommendations in 2004 now suggests that women younger than 35 years old with good prognoses consider
single-embryo transfers with no more than two embryos only under extraordinary circumstances. For women aged 35-37 years the recommendation is
two embryos for those with good prognoses and no more than 3 for those with
poorer prognoses. The Canadian guidelines issued in 2006 referred to earlier
in section 11.2.1 are similar, but more detailed and specic. Because they
are consistent with, and essentially capture and consolidate all the results of
the previously highlighted studies into a single set of cohesive points, the key
aspects are presented below:
1. Individual IVF-ET (embryo transfer) programs should evaluate their
own data to identify patient-specic, embryo-specic, and cycle-specic
determinants of implantation and live birth in order to develop embryo
transfer policies that minimize the occurrence of multifetal gestation
while maintaining acceptable overall pregnancy and live birth rates.
2. In women under the age of 35 years, no more than two embryos should
2 American Society of Reproductive Medicine. Guidelines on number of embryos transferred. Birmingham, Alabama, 1999
371
11.3
11.3.1
372
Random Phenomena
proving the chances that at least one embryo will implant and lead to a
live birth;
How many (and which ones) of the n transferred embryos will ultimately
lead to live births is also uncertain.
If the transfer of n embryos can be considered as n independent
(Bernoulli) trials under identical conditions; and if the overall eect of the collection of factors that inuence the ultimate outcome of each single trial
the transfer of a single embryois captured in the parameter p representing
the probability that a particular single embryo will lead to a successful pregnancy; then observe that X, the number of live births in a delivered pregnancy
following an IVF treatment cycle involving the transfer of n embryos, is a binomial random variable whose pdf is as given in Chapter 8, i.e.:
n x
(11.1)
f (x) =
p (1 p)nx
x
an expression of the probability of obtaining x live-born babies from n embryos. When x = 1, the live birth is said to result in a singleton the most
desirable outcome; a multiple-birth is said to occur when x = 2 (fraternal
twins), or 3 (triplets), or 4 (quadruplets), . . . , etc. up to and including n. How
this postulated model matches up with real clinical data is examined shortly.
The characteristics of the binomial random variable and its model have
been discussed in Chapter 8 and the reader may wish to pause at this point to
review these. Within the context of IVF, the parameter p has a very specic
physiological interpretation: it is what is referred to in Jansen, 2003, as a
womans total chance for a live birth from one retrieval. We will refer to it in
the rest of this chapter as the single embryo probability of success (or SEPS)
parameter. It is sometimes referred to as the embryo implantation potential
in the ART literature, indicative of its characteristic as a composite of both
embryo and uterine properties. If this parameter is known, even approximately
(see the discussion to follow about the sensitivity of the model results to the
degree of accuracy to which p is determined), then the mathematical model in
Eqn (11.1) allows us to carry out a wide variety of theoretical analyses regarding IVF, including outcome prediction, estimation (patient characterization),
and optimization.
11.3.2
Prediction
Consider, for example, a case where the combined patient/embryo conditions are characterized by the SEPS parameter p = 0.2 (indicating a 20%
chance of success for each embryo). The binomial model allows us to say
the following about the transfer of n = 5 embryos, for instance:
1. Because E(X) = np for the binomial random variable X, in this particular case, E(X) = 1, implying that the expected outcome of this IVF
treatment cycle is 1 live birth;
373
TABLE 11.1:
2. Since the theoretical variance, 2 = np(1 p) = 0.8, (so that the standard deviation, = 0.89), the general implication is that there is a
fair amount of variability associated with the expected outcomes in this
specic treatment scenario. In fact,
3. The full probability distribution can be computed as shown in Table
11.1, indicating a 32.8% chance that the IVF treatment will not succeed in producing a child, but a somewhat higher 41.0% chance of a
singleton; a 20.5% chance of twins, a 5.1% chance of triplets, and less
than 1% chance of quadruplets or quintuplets. A common alternative
interpretation of the indicated probability distribution is shown in the
last column: in a population of 1,000 identical patients undergoing
the same treatment, under essentially identical conditions, as a result
of the transfer of 5 embryos to each patient, 328 patients will produce
no live births, 410 will have singletons, 205 will have twins, 51 will have
triplets, 6 will have quadruplets, and none will have quintuplets.
4. From this table we see that there is more than a 99% chance that 0 <
x < 3, with the following implication: while the expected outcome is
a singleton, the actual outcome is virtually guaranteed to be anything
from a complete failure to a triplet and everything in between; it is
highly unlikely to observe any other outcome.
11.3.3
Estimation
The practical utility of the binomial model for IVF clearly depends on
knowing the lone model parameter p that characterizes the probability of a
single embryo transfer leading to a successful live birth. In the absence of
reliable technology for determining an appropriate value directly from physiological measurements, this parameter value must then be determined from
clinical data, with best results when the data sets are generated from carefully
designed experiments.
374
Random Phenomena
Consider, for example, the following statement taken from the WSJ article
mentioned earlier:
In 1999, based on results from over 35,000 IVF treatments, the Centers
for Disease Control and Prevention reported that between 10% and 13%
of women under 35 who had three embryos introduced got pregnant with
triplets.
This statement translates as follows: for the women in this study, n = 3 and
0.1 < P (X = 3) < 0.13. From the binomial model in Eqn (11.1), with an
unknown p and n = 3, we know that
P (X = 3) = f (3) = p3
(11.2)
and upon substituting the limiting values of 0.1 and 0.13 for the probability
of obtaining triplets, we immediately obtain
p = [0.46, 0.51]
(11.3)
=
=
(11.4)
(11.5)
Thus, for example, one of the entries in the data set found in Table I
of Elsner, et al., 1997, indicates that 661 patients each received 3 embryos
resulting in 164 singletons, 74 twins, and 10 triplets, with no higher order
births. In our notation, n = 3, N3 = 661, 3 (1) = 164, 3 (2) = 74, and
3 (3) = 10, so that the estimate of p for this cohort is given by:
p =
164 + 2 74 + 3 10
= 0.172
3 661
(11.6)
375
Note the very important assumption that all 661 patients in this cohort group
have identical (or at least essentially similar) characteristics. We shall
have cause to revisit this data set and these implied assumptions in the next
section.
11.4
Before proceeding to use the binomial model for IVF optimization, we wish
rst to validate the model against clinical data available in the literature.
Primarily because of how they are reported, the data sets presented in the
Elsner, et al., 1997, study (briey referred to in the previous subsection) are
structurally well-suited to the binomial model validation exercise. But this
was not a study designed for model validation, otherwise the design would
have required more control for extraneous sources of variability within each
cohort group. Nevertheless, one can still put these otherwise rich data sets to
the best use possible, as we now show.
11.4.1
The data sets in question are from a retrospective study of 2,173 patients
on which fresh and frozen-thawed embryo transfers were performed in the
authors own clinic over a 42-month period from September 1991 to March
1995. A total number of 6,601 embryos were transferred ranging from 1 to
6 embryos per transfer. Most importantly, the data are available for cohort
groups of Nn patients, receiving n = 1, 2, . . . , 6 embryos; and on n (x), the
number of patients with pregnancy outcome x (x = 1 for singletons, 2 for
twins, 3 for triplets, etc), presented separately for each cohort group, making
them structurally ideal for testing the validity of the binomial model. Table
11.2 shows the relevant data arranged appropriately for our purposes (by
cohort groups according to embryos received, from 1 through 6).
For each cohort group n = 1, 2, 3, 4, 5, 6, the estimates of the probability
of success are obtained from the data as p1 = 0.097; p2 = 0.163; p3 =
0.172; p4 = 0.149; p5 = 0.111; p6 = 0.125 for an overall probability of success
for the entire study group p = 0.154. Some important points to note:
These values are the same as the embryo implant value computed by
Elsner, et al.;
Although the overall group average is 0.154, the values for each cohort
group range from a low of 0.097 for those receiving a single embryo to a
high of 0.172 for those receiving 3 embryos.
As noted in the paper, the value 0.097 is signicantly lower than the
x
Delivered
pregnancy
outcome
0
1
2
3
4
5
Total
TABLE 11.2:
376
Random Phenomena
377
numbers computed for the 2-, 3-, and 4-, embryo cohort group (which
also means that it is signicantly lower than the overall group average
of 0.154). The implication of this last point therefore is that one cannot
assume a uniform value of p for the entire study involving 6,601 embryos; it also raises the question of whether even the computed pi for
each cohort group can be assumed to be uniform for the entire group
(especially groups with large numbers of embryos involved such as the
3- and 4- embryo cohort groups). This issue is addressed directly later.
11.4.2
On the basis of the estimated group probabilities, pn , the binomial probability model for each group is obtained as in Eq (11.1):
n x
fn (x) =
p (1 pn )nx
(11.7)
x n
providing the probability of obtaining pregnancy outcome x = 0, 1, 2, . . . , 6, for
each cohort group receiving n embryos. Now, given Nn the number of patients
in each cohort group (referred to as the number of cycles in the original
paper) we can use the model to predict the expected number of patients
receiving n embryos that eventually have x = 0, 1, 2, . . . 6 as the delivered
pregnancy outcome, n (x), as follows:
n (x) = fn (x)Nn (x);
(11.8)
The result is shown in Table 11.3, with a graph comparing the model prediction to the data shown in Fig 11.1.
While the model prediction shows reasonable agreement with the overall
data, there are noticeable discrepancies, most notably the over-estimation of
the number of singletons and the consistent underestimation of the number of
multiple births, especially for the two largest cohort groupsthose receiving 3
and 4 embryos. The primary source of these discrepancies is the questionable
assumption of uniform p for each cohort group. Is it really realistic, for example, to expect all 832 patients in the cohort group that received 4 embryos
(and the 832 4 total embryos transferred in this group) to have similar values of p? In fact, this question was actually (unintentionally) answered in the
study (recall that the objective of the study was really not the determination
of implantation potential for cohort groups).
When the data is segregated by age coarsely into just two sets, the
younger set for patients 36 years old, and the older set for patients
37 years old (as was done in Tables II and III of the original paper and
summarized here in Table 11.4 for convenience), the wide variability in the
values for the single embryo probability of success parameter, p is evident.
There are several important points to note here: First, observe the less
x
Delivered
pregnancy
outcome
0
1
2
3
4
5
Total
TABLE 11.3:
378
Random Phenomena
1600
Variable
Base Model Prediction
Elsner Data
1400
Number of Patients
1200
1000
800
600
400
200
0
0
2
3
4
x, Pregnancy Outcome
TABLE 11.4:
379
380
Random Phenomena
obvious fact that for each cohort group, n = 1, 2, .. . . . 6, the overall p estimate
is naturally a weighted average of the values estimated for each sub-group
(younger and older); and as one would naturally expect, the weight in
each case is the fractional contribution from each sub-group to the total number. Second, and more obvious, is how widely variable the estimates of p are
across each cohort group: for example, for the group receiving n = 2 embryos,
0.087 < p < 0.211, with the combined group value of 0.163 almost twice the
value estimated for the older sub-group). This latter observation underscores
a very important point regarding the use of this particular data set for our
model validation exercise: within the context of IVF, the binomial model is an
individual patient model that predicts the probabilities of various pregnancy
outcomes for a specic patient given her characteristic parameter, p. However,
such a parameterat least in light of currently available technologycan only
be estimated from clinical data collected from many patients. Obtaining reasonable estimates therefore requires carefully designed studies involving only
patients with a reasonable expectation of having similar characteristics. Unfortunately, even though comprehensive and with just the right kind of detail
required for our purposes here, the Elsner data sets come from a retrospective
study; it is therefore not surprising if many patients in the same cohort group
do in fact have dierent implantation potential characteristics.
One way to account for such non-uniform within-group characteristics
is, of course, to repeat the modeling exercise for each data set separately
using age-appropriate estimates of p for each cohort sub-group. The results
of such an exercise are shown in Figs 11.2 and 11.3. While Fig 11.3 shows
a marked improvement in the agreement between the model and the older
sub-group data, the similarity of the model-data t in Fig 11.2 to that in
Fig 11.1 indicates that even after such stratication by age, signicant nonuniformities still exist.
There are many valid reasons to expect signicant non-uniformities to persist in the younger sub-group: (i) virtually all clinical studies on the eect of
age on IVF outcomes (e.g., Schieve et al., 1999; Jansen, 2003; and Vahratian,
et al.,2003,) recognize the age group < 29 years to be dierent in characteristics from the 29-35 years age group. Even for the older sub-group, it is
customary to treat the 40-44 years group dierently. The data set could thus
use a further stratication to improve sub-group uniformity. Unfortunately,
only the broad binary younger/older stratication is available in the Elsner et al., data set. Nevertheless, to illustrate the eect of just one more level
of stratication, consider the following postulates:
1. The n = 3 cohort group of 661 patients already stratied into the
younger 432 (
p = 0.184), and older 229 (
p = 0.150) is further stratied as follows: the younger 432 separated into 288 with p = 0.100 and
144 with p = 0.352 (maintaining the original weighted average value of
p = 0.184); and the older 229 divided into 153 with p = 0.100 and
the remaining 76 with p = 0.25, (also maintaining the same original
weighted average value of p = 0.150);
381
900
Variable
Model Prediction (Younger)
ElsnerData (Younger)
800
Number of Patients
700
600
500
400
300
200
100
0
0
2
3
4
x, Pregnancy Outcome
FIGURE 11.2: Elsner data (Younger set) versus binomial model prediction
2. The n = 4 cohort group of 832 patients, already stratied into the
younger 522 (
p = 0.160), and older 310 (
p = 0.128) is further stratied as follows: the younger 522 separated into 348 with p = 0.08 and
174 with p = 0.320; and the older 310 into 207 with p = 0.08 and
the remaining 103 with p = 0.224 (in each case maintaining the original
respective weighted average values);
3. The n = 5 cohort group of 47 patients, with only the younger 26 with
p = 0.131 group separated into 17 with p = 0.06 and the remaining 9
with p = 0.265 (again maintaining the original weighted average value
of p = 0.131).
Upon using this simple stratication of the Elsner data, the results of the
stratied model compared with the data is shown rst in tabular form in
Table 11.5 and in Figs 11.4, 11.5 respectively for the stratied younger and
older data, and Fig 11.6 for the consolidated data. The agreement between
the (stratied) model and data is quite remarkable, especially in light of all
the possible sources of deviation of the clinical data from the ideal binomial
random variable characteristics.
The nal conclusion therefore is as follows: given appropriate parameter
estimates for the clinical patient population (even very approximate estimates
obtained from non-homogenous subgroups), the binomial model matched the
clinical data quite well. Of course, as indicated by the parameter estimates in
Table 11.4, the value of p for the patient population in the study is not constant
but is itself a random variable. This introduces an additional component to the
issue of model validation. The strategy of data stratications by p that we have
employed here really constitutes a manual (and ad-hoc) attempt at dealing
382
Random Phenomena
600
Variable
Model Pred (Older)
Elsner Data (Older)
Number of Patients
500
400
300
200
100
0
0
2
3
4
x, Pregnancy Outcome
FIGURE 11.3: Elsner data (Older set) versus binomial model prediction
TABLE 11.5:
data.
Delivered
pregnancy
outcome
x
0
1
2
3
4
5
Total
with this additional component indirectly. We have opted for this approach
here primarily for the sake of simplicity. A more direct (and more advanced)
approach to the data analysis will involve postulating an additional probability
model for p itself, which, when combined with the individual patient binomial
model will yield a mixture distribution model (as illustrated in Section 9.1.6 of
Chapter 9). In this case, the appropriate model for p is the Beta distribution;
and the resulting mixture model will be the Beta-Binomial model (See Exercise
9.28 at the end of Chapter 9). Such a Beta-Binomial model analysis of the
Elsner data is oered as a Project Assignment at the end of the chapter.
Finally, it is important to note that none of this invalidates the binomial
model; on the contrary, it reinforces the fact that the binomial model is a
single patient model, so that for the mixed population involved in the Elser
383
900
Variable
Stratified Model Pred
ElsnerData (Younger)
800
Number of Patients
700
600
500
400
300
200
100
0
0
2
3
4
x, Pregnancy Outcome
FIGURE 11.4: Elsner data (Younger set) versus stratied binomial model prediction
600
Variable
Stratified Model Pred
Elsner Data (Older)
Number of Patients
500
400
300
200
100
0
0
2
3
4
x, Pregnancy Outcome
FIGURE 11.5: Elsner data (Older set) versus stratied binomial model prediction
384
Random Phenomena
1600
Variable
Stratified Model Pred (Ov erall)
Elsner Data
1400
Number of Patients
1200
1000
800
600
400
200
0
0
2
3
4
x, Pregnancy Outcome
FIGURE 11.6: Complete Elsner data versus stratied binomial model prediction
et al. clinical study, the value of p is better modeled with a pdf of its own, to
capture how p itself is distributed in the population explicitly.
We will now proceed to use the binomial model for analysis and optimization.
11.5
385
the second. From the binomial model, these probabilities are given explicitly
as:
P0
P1
PMB
(1 p)n
=
=
np(1 p)
P (X > 1) = 1 (P0 + P1 )
(11.10)
(11.11)
(11.9)
n1
Note that these three probabilities are constrained to satisfy the expression:
1 = P0 + P1 + PMB
(11.12)
with the all-important implication that any one of these probabilities increases
(or decreases) at the expense of the others. Still, each probability varies with
n in a distinctive manner that can be exploited for IVF optimization, as we
now show.
11.5.1
Optimization
In the most general sense, the optimum number of embryos to transfer in any IVF cycle is that number n which simultaneously minimizes P0 ,
maximizes P1 , and also minimizes PMB . From the model equations, however,
observe that (a) P0 is a monotonically decreasing function of n, with no minimum for nite n; (b) although not as obvious, PMB has no minimum because
it is a monotonically increasing function of n; but fortunately (c) P1 does in
fact have a maximum. However, the most important characteristic of these
probabilities is the following: by virtue of the constraint in Eq. (11.12), maximizing P1 also simultaneously minimizes the combined sum of the undesirable
probabilities (P0 + PMB )!
We are therefore faced with the fortunate circumstance that the IVF
optimization problem can be stated mathematically simply as:
n = arg max np(1 p)n1
n
(11.13)
and that the resulting solution, the optimum number of embryos to transfer
which maximizes the probability of obtaining a singleton, also simultaneously
minimizes the combined probability of undesirable side eects.
The closed-form solution to this problem is obtained via the usual methods
of dierential calculus as follows:
Since whatever maximizes
fn (1) = np(1 p)n1
(11.14)
(11.15)
386
Random Phenomena
0.3
20
Variable
n*
n*_quant
15
10
5
3
0
0.0
0.1
0.2
0.3
0.4
p
0.5
0.6
0.7
0.8
1
1p
(11.16)
11.5.2
Model-based Analysis
The general binomial pdf in Eq (11.1), or the more specic expressions for
the probabilities of direct interest to IVF treatment derived from it, and shown
in Eqs (11.9), (11.10) and (11.11), provide some insight into the probabilistic
characteristics of IVF treatment. For the most desirable pregnancy outcome,
387
x = 1, the singleton live birth, Fig 11.8 shows the complete surface plot of
fn (1) = np(1 p)n1
(11.17)
as a function of n and p. Note the general nature of the surface but in particular
the distinctive ridge formed by the maxima of this function. The following are
some important characteristics of IVF this gure reveals:
As indicated by the lines sweeping from left to right in the gure, for any
given n embryos transferred, there is a corresponding patient SEPS parameter
p for which the probability of obtaining a singleton is maximized. Furthermore,
as n increases (from back to front), the appropriate maximal p is seen to
decrease, indicating that transferring small numbers of embryos works best
only for patients with high probabilities of success, while transferring large
numbers of embryos works best for patients for whom the probability of success
is relatively low. It also shows that for those patients with relatively high
probabilities of success (for example, young patients under 35 years of age),
transferring large numbers of embryos is counterproductive: the probability
of obtaining singletons in these cases is remarkably low across the board (see
the at surface in the bottom right hand corner of the gure) because the
conditions overwhelmingly favor multiple births over singletons. All this, of
course, is in perfect keeping with current thought and practice; but what is
provided by Eq (11.17) and Fig 11.8 is quantitative.
The complementary observation from the lines sweeping from back to front
is that for a given single embryo probability of success, p, there is a corresponding number of embryos to transfer that will maximize the probability of
obtaining a singleton. Also, as p increases (from left to right), this optimum
number is seen to decrease. This, of course, is what was shown earlier quite
precisely in Fig 11.7.
Finally, when the optimum number of embryos, n , are transferred for each
appropriate value of the SEPS parameter, p, (the mathematical equivalent of
walking along the ridge of the mountain-like surface in the gure), the corresponding theoretical maximum probability of obtaining a singleton increases
with p in a manner shown explicitly in Fig 11.9. The indicated elbow discontinuity occurring at p = 0.5 is due to the fact that for p < 0.5, n > 1 and
fn (1) involves integer powers of p; but for p 0.5, n = 1 so that fn (1) = p,
a straight line with slope 1.
For the sake of completeness, Figures 11.10 and 11.11 show the surface
plots respectively for
fn (0) = (1 p)n
(11.18)
(11.19)
and
388
Random Phenomena
1.0
f(1)
0.5
0.0
0
5
1.0
10
0.5
0.0
FIGURE 11.8: Surface plot of the probability of a singleton as a function of p and the
number of embryos transferred, n
0.8
f*(1)
0.7
0.6
0.5
0.4
0.0
0.1
0.2
0.3
0.4
p
0.5
0.6
0.7
0.8
389
1.0
f(0)
0.5
0.0
0
5
10
1.0
0.5
0.0
FIGURE 11.10: Surface plot of the probability of no live birth as a function of p and
the number of embryos transferred, n
1.0
f(m)
0.5
0.0
0
n
1.0
0.5
10
0.0
390
Random Phenomena
11.5.3
We now return to Fig 11.7 and note that it allows us to categorize IVF
patients on the basis of p (and, by extension, the optimum prescribed number
of embryos to transfer) as follows.
1. Good prognosis patients: p 0.5.
For this category of patients, n = 1, with the probability of obtaining
a singleton, f (1) = 0.5.
2. Medium prognosis patients: 0.25 p < 0.5.
For this category of patients, n = 2 or 3, with the probability of obtaining a singleton, 0.42 < f (1) < 0.5.
3. Poor prognosis patients: 0.15 p < 0.25.
For this category of patients, 4 < n < 6, with the probability of obtaining a singleton, 0.40 < f (1) < 0.42.
4. Exceptionally poor prognosis patients: p < 0.15
For this category of patients, n > 6, but even then the probability of
obtaining a singleton, f (1) 0.40.
Let us now use the probability model to examine, for each patient category,
how the number of embryos transferred inuences the potential treatment
outcomes.
Beginning with a representative value of p = 0.5 for the good prognosis
category (e.g. women under age 35, as was the case in the study quoted in the
WSJ article that we used to illustrate the estimation of p in Eqs (11.2) and
(11.3)), Fig 11.12 shows a plot of the probabilities of the outcomes of interest,
P0 , P1 , and PMB , as a function of n.
A few points are worth noting in this gure: rst, P1 , the probability of
obtaining a singleton, is maximized for n = 1, as noted previously; however,
this gure also shows that the same probability P1 = 0.5 is obtained for n = 2.
Why then is n = 1 recommended and not n = 2? Because with n = 1, there
is absolutely no risk whatsoever of multiple births, whereas with n = 2, the
probability of multiple births (only twins in this case) is no longer zero, but
a hard-to-ignore 0.25.
Note also that transferring more than 1 or 2 embryos for this class of patients actually results in a reduction in the probability of a singleton outcome,
at the expense of rapidly increasing the probability of multiple births. (Recall
the constraint relationship between the probabilities of pregnancy outcomes
shown in Eq (11.12).)
When 5 embryos are transferred, there is an astonishing 85% chance of
multiple births, and specically, a 1 in 32 chance of quintuplets (details not
shown but easily computed from Eq (11.1)). Thus, going back to the WSJ
Variable
P0
P1
PMB
0.8
Treatment Outcome Probability
391
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1
2
3
4
Number of embryos transfered, n
FIGURE 11.12: IVF treatment outcome probabilities for good prognosis patients
with p = 0.5, as a function of n, the number of embryos transferred
story of Theresa Anderson with which we opened this chapter, it is now clear
that perhaps what her doctor meant to say was that there was a one chance
in 30 that all ve embryos would take rather than that only one would take.
With the binomial probability model, one could have predicted the distinct
possibility of this particular patient delivering quintuplets because she belongs
to the category of patients with good prognosis, for which p 0.5.
Next, consider a representative value of p = 0.3 for medium prognosis
patients, which yields the plots shown in Fig 11.13 for the probabilities P0 , P1 ,
and PMB as a function of n. Observe that as noted previously, the optimum
n corresponding to this specic value of p = 3 is clearly 3. Transferring fewer
embryos is characterized by much higher values for P0 , the probability of
producing no live birth; transferring 4 embryos increases the probability of
multiple births more than is oset by the simultaneous reduction in P0 ; and
with 5 embryos, the probability of multiple births dominates all other outcome
probabilities.
Finally, when a representative value of p = 0.18 is selected for poor
prognosis patients, the resulting outcome probability plots are shown in Fig
11.14. First note that for this value of p, the optimum n is 5. Recall that the
Combelles et al., 2005, study concluded from evidence in their clinical data
that n = 5 is optimum for women more than 40 years of age. In light of our
theoretical analysis, the implication is that for the class of patients referred
to in this study, p = 0.18 is a reasonable characteristic parameter. As an
independent corroboration of the model-based analysis shown in this gure,
consider the following result from the Schieve, et al., 1999, study which states:
392
Random Phenomena
Variable
P0
P1
PMB
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1
2
3
4
Number of Embryos Transferred, n
FIGURE 11.13: IVF treatment outcome probabilities for medium prognosis patients
with p = 0.3, as a function of n, the number of embryos transferred
Among women 40 to 44 years of age, the multiple-birth rate was
less than 25% even if 5 embryos were transferred.
If the patients in this study can reasonably be expected to have characteristics
similar to those in the Combelles study, then from our preceding theoretical
analysis, p = 0.18 is a reasonable estimate for them also. From our probability
model, the specic value for PMB when n = 5 for this class of patients is
therefore predicted to be 0.22 (dotted line and diamond symbol for n = 5),
which agrees precisely with the above-noted result of the Schieve et al., study.
11.6
11.6.1
Sensitivity Analysis
General Discussion
Clearly, the heart of the theoretical model-based analysis presented thus far
is the parameter p. This, of course, is in agreement with clinical practice, where
embryo transfer policies are based on what the Canadian guidelines refer to as
patient-specic, embryo-specic, and cycle-specic determinants of implantation and live birth. From such a model-based perspective, this parameter
is therefore the single most important parameter in IVF treatment: it determines the optimum number of embryos to transfer, and, in conjunction with
the actual number of embryos transferred (optimum or not), determines the
various possible pregnancy outcomes and the chances of each one occurring.
Variable
P0
P1
PMB
0.8
Treatment Outcome Probability
393
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0
4
6
8
Number of Embryos Transferred, n
10
FIGURE 11.14: IVF treatment outcome probabilities for poor prognosis patients
with p = 0.18, as a function of n, the number of embryos transferred
394
Random Phenomena
11.6.2
The general question of interest here is: how sensitive is the model and its
analysis results to errors in p?, or in more practical terms, how good does
the estimate of p have to be so that the prescription of n , and the theoretical analysis following from it, can be considered reliable? Such questions are
answered quantitatively with the relative sensitivity function, dened in this
case as:
Sr =
ln n
.
ln p
395
(11.20)
(11.21)
p
(1 p) ln(1 p)
(11.22)
a plot of which is shown in Fig 11.15. First, note in general that Sr is always
negative and greater than 1 in absolute value. This implies that (i) overestimating p always translates to underestimating n , and vice versa (as we
already alluded to in the preceding general discussion); and (ii) the relative
over- or under-estimation error in p is always magnied in the corresponding relative error in n . Quantitatively, the specic information contained in
this gure is best illustrated by considering what it indicates about the region 0 < p < 0.25 identied in our previous discussion for poor prognosis
patients as critical in terms of sensitivity to errors in p. Observe that in this
region, 1.16 < Sr < 1.0, with the implication that a 10% error in p estimates translates to no more than an 11.6% error in the recommended n .
Keep in mind that in practice, fractional errors in n are inconsequential until
they become large enough be rounded up (or down) to the nearest integer.
Thus, an 11.6% error on a nominal value of n = 10 (or for that matter any
error less than 15%) translates to only 1 embryo. Thus, even though from
our previous discussion we know that good estimates of p are required for
poor prognosis patients, the relative sensitivity function (Eq (11.22), and
Fig 11.15) indicate that the model can tolerate as much as 10% error in the
estimate of p with little or no consequence in the nal results.
11.7
11.7.1
396
Random Phenomena
-1.0
-1.2
Relative Sensitivity
-1.4
-1.6
-1.8
-2.0
-2.2
-2.4
-2.6
0.0
0.1
0.2
0.3
0.4
p
0.5
0.6
0.7
0.8
397
expression shown in Eq (11.16) and plotted in Fig 11.7, states this optimum
number of embryos, n , explicitly as a function of the parameter p. Also, we
have shown that these results are robust to unavoidable errors in estimating
this parameter in practice.
11.7.2
Here, then, is a series of key conclusions from the discussion in this chapter.
First, the following is a list of the characteristics of the binomial model of
IVF treatment:
1. It provides a valid mathematical representation of reality;
2. It depends on a single key parameter, p, whose physiological meaning is
clear (the single embryo probability of successor the embryo implantation potential); and
3. It is used to obtain an explicit expression for the optimum number of embryos to transfer in an IVF cycle, and this result is robust to uncertainty
in the single model parameter.
Next, we note that the binomial model-based prescription of the optimum number of embryos to transfer agrees perfectly with earlier heuristics
and guidelines developed on the basis of specic empirical studies. While the
precise number can be obtained analytically from Eq (11.16), the practical
implications of the results may be summarized as follows:
1. For good prognosis patients, p 0.5, transfer only 1 embryo;
2. For medium prognosis patients, 0.25 < p < 0.5, transfer 2 embryos
for those with p 0.35 and 3 for those with p < 0.35;
3. For poor prognosis patients, p < 0.25, transfer n > 3 embryos with the
specic value in each case depending on the value of p, as determined by
Eq (11.16) rounded to the next integer: for example, n = 4 for p = 0.2;
n = 5 for p = 0.18, n = 6 for p = 0.15, etc.
These results agree with, but also generalize, the results of previous clinical
studies, some of which were reviewed in earlier sections. Thus, for example,
the primary result in the Combelles et al., 2005, study, that n = 5 is optimum
for patients older than 40 years, strictly speaking can only be considered valid
for the specic patients in the study used for the analysis. The prescription
of the binomial model on the other hand is general, and not restricted to any
particular data set; it asserts that n = 5 is optimal for all patients for which
p = 0.18 whether they are 40 years old, younger, or older. This leads to the
nal set of conclusions having to do with the perspective on previous studies
and IVF treatment guidelines provided by the binomial model-based analyses.
398
Random Phenomena
In light of the analyses and discussions in this chapter, the most important
implication of the demonstrated appropriateness of the binomial model for
IVF treatment is that treatment guidelines should be based on the value of
the parameter p for each patient, not so much age. (See recommendation 1 of
the Canadian guidelines summarized earlier.) From this perspective, age in the
previous studies is seen as a convenientbut not always a perfectsurrogate
for this parameter. It is possible, for example, for a younger person to have
a lower SEPS parameter p, for whatever reason, uterine or embryo-related.
Conversely, an older patient treated with eggs from a young donor will more
than likely have a higher-than-expected SEPS parameter value. In all cases,
no conicts arise if the transfer policy is based on the best estimate of p rather
than age: p is the more direct determinant of embryo implantation rate; age
is an indirect and not necessarily foolproof, indicator.
On the basis of this sections model-based discussion therefore, all the
previous studies and guidelines may be consolidated as follows:
1. For each patient, obtain the best estimate of the SEPS parameter, p;
2. Use p to determine n either from the analytical expression in Eq (11.16)
rounded to the nearest integer, or else from Fig (11.7);
3. If desired, Eqns (11.9), (11.10) and (11.11) may then be used to analyze
outcome probabilities given the choice of the number of embryos to
transfer (see for example, Fig 11.13).
Finally, it should not be lost on the reader just how much the probability modeling approach discussed in this chapter has facilitated the analysis
and optimization of such a complex and important problem as that posed by
IVF outcome optimization. Even with the unavoidable idealization implied
in the SEPS parameter, p, this binomial model parameter provides valuable
insight into the fundamental characteristics of the IVF outcome problem. It
also allows a consolidation of previous qualitative results and guidelines into
a coherent and quantitative three-point guideline enumerated above.
References
1. Austin, C. M., S.P. Stewart, J. M. Goldfarb, et al., 1996. Limiting multiple
pregnancies in in Vitro fertilization/embryo transfer (IVF-ET) cycles, J. Assisted Reprod and Genetics, 13 (7) 540-545.
2. Bolton, V.N., S.M. Hawes, C.T., Taylor and J.H. Parsons, 1989. J In Vitro
Fert Embryo Transf., 6 (1) 30-35.
3. Combelles, C.M.H., B. Orasanu, E.S. Ginsburg, and C. Racowsky, 2005. Optimum number of embryos to transfer in women more than 40 years of age
399
400
Random Phenomena
PROJECT ASSIGNMENT
Beta-Binomial Model for the Elsner Data.
As noted at the end of Section 11.4, to deal appropriately with the mixed
population involved in the Elsner clinical study, a theoretical probability
model should be used for p; this must then be combined with the binomial
model to yield a mixture model. When the Beta B(, ) model is chosen for
p, the resulting mixture model is known as a Beta-Binomial model.
As a project assignment, develop a Beta-Binomial model for the
Elsner data in Table 11.2 and compare the model prediction with
the data and with the binomial model prediction presented in this
chapter.
You may approach this assignment as follows:
The Beta-Binomial mixture distribution arises from a Binomial Bi(n, p)
random variable, X, whose parameter p, rather than being constant, has a
Beta distribution, i.e., it consists of a conditional distribution,
n x
(11.23)
f (x|p) =
p (1 p)nx
x
in conjunction with the marginal distribution for p,
f (p) =
( + ) 1
p
(1 p)1 ; 0 < x < 1; > 0; > 0
()()
(11.24)
E(X) = n
= n
+
(11.25)
(11.26)
++n
++1
(11.27)
401
4. Plot the theoretical pdf for f (p) using the values you determined for
the parameters and . Discuss what this implies about how the SEPS
parameter is distributed in the population involved in the Elsner clinical
study.
5. Compute probabilities for x = 0, 1, . . . 6, using the Beta-Binomial model
and compare with the data.
Write a report describing your model development and data analysis, including comments on the t of this mixture model to the data versus the
binomial model t that was discussed in this chapter.
402
Random Phenomena
Part IV
Statistics
403
405
The days of our years are threescore years and ten; and if by reason of strength they be fourscore years, yet is their strength labor
and sorrow; for it is soon cut o, and we y away. . . . So teach us
to number our days that we may apply our hearts unto wisdom.
Moses (c. 1450 BC) Psalms 90:10,12, KJV
406
Chapter 12
Introduction to Statistics
408
408
411
412
414
415
415
416
416
417
418
419
423
427
430
435
436
436
437
438
438
438
439
439
439
440
440
440
442
444
447
447
449
454
The uncertainty intrinsic to individual observations of randomly varying phenomena, we now know, need not render mathematical analysis impossible. The
discussions in Parts II and III have shown how to carry out such analysis, so
long as one is willing to characterize the more stable complete ensemble, and
407
408
Random Phenomena
not the capricious individual observations. The key has been identied as
the ensemble characterization, via the probability distribution function (pdf),
f (x), which allows one to carry out probabilistic analysis about the occurrence
of various observations of the random variable X.
In practice, however, f (x), the theoretical ensemble model for the random
phenomenon in question, is never fully available we may know its form,
but the parameters are unknown and must be determined for each specic
problem. Only a nite collection of individual observations {xi }ni=1 is available.
But any nite set of observations will, like individual observations, be subject
to the vagaries of intrinsic variability. Thus, in any and all analysis carried out
on this basis, uncertainty is unavoidable. Given this practical reality, how then
does one carry out systematic analysis of random phenomena which requires
the full f (x) on the basis of nite data? How does one fully characterize the
entire ensemble from only a nite collection of observations? Clearly, {xi }ni=1
is related to, and contains information about, the full f (x); but how are they
related, and how can this information be extracted and exploited? What kinds
of analyses are possible with nite observations, and how does one cope with
the unavoidable uncertainty?
Statistics provides rigorous answers to such questions as these by using the
very same probability machinery of Parts II and III to deal generally with the
theoretical and practical issues associated with analyzing randomly varying
phenomena on the basis of nite data. Our study begins in this chapter with
an introduction of Statistics rst as a conceptual framework complementary
to and dependent on Probability. We then provide a brief overview of
the components of this Statistical Analysis framework, as a prelude to the
detailed study of each of these components in the remaining chapters of Part
IV.
12.1
12.1.1
Introduction to Statistics
409
2. {xi }ni=1 : n individual observations; one set out of many other possible
realizations of the random variable.
This is commonly referred to as the data; it is the only entity available
in practice (from experiments). In the illustrative examples of Chapter
1, n = 50 each for process A, and process B; n = 60 for the glass sheets.
3. f (x): aggregate (or ensemble) description of the random variable; the
probability distribution function.
This is the theoretical model of how the probability of obtaining various results are distributed over the entire range of all possible values
observable for X. It was discussed and characterized generically in Chapter 4, before specic forms were derived for various random variables of
interest in Chapters 8 and 9. There we saw that it consists of a functional form, f (x), and characteristic parameters; it is therefore more
completely represented as f (x|), which literally reads f (x) given ,
where is the vector of characteristic parameters. In the rst illustrative
example of Chapter 1, f (x) is the Gaussian distribution, with parameters = (, 2 ); in the second, it is a Poisson distribution with one
characteristic parameter, .
Probabilistic random phenomena analysis is based entirely on f (x). This
is what allows us to abandon the impossible task of predicting an intrinsically unpredictable entity in favor of the more mathematically tractable task
of computing the probabilities of observing the randomly varying outcomes.
Until now, our discussion about probability and the pdf has been based on the
availability of the complete f (x), i.e., f (x|) with known. This allowed us
to focus on the rst task: computing probabilities and carrying out analysis,
given any functional form f (x) with values of the accompanying characteristic parameters, , assumed known. We have not been particularly concerned
with such practical issues as where either the functional form, or the specic
characteristic parameter values come from. With the rst task complete, we
are now in a position to ask important practical questions: in actual practice,
what is really available about any random variable of interest? and how does
one go about obtaining the complete f (x) required for random phenomena
analysis?
It turns out that for any specic randomly varying phenomenon of interest,
the theoretical description, f (x), is never completely available, usually because
the characteristic parameters associated with the particular random variable
in question are unknown; only nite data in the form of a set of observations
{xi }ni=1 ; (n < ), is available in practice. The immediate implication is that
to apply the theory of random phenomena analysis successfully to practical
problems, we must now confront the practical matter of how the complete
f (x) is to be determined from nite data, for any particular random variable,
X, of interest. The problem at hand is thus one of analyzing randomly varying
phenomena on the basis of nite data sets; this is the domain of Statistics.
With Probability, f (x) the functional form along with the parameters is
410
Random Phenomena
Introduction to Statistics
The ensemble description is the binomial pdf:
3 x
f (x|p) =
p (1 p)3x ; x = 0, 1, 2, 3
x
411
(12.1)
12.1.2
412
Random Phenomena
(12.3)
Introduction to Statistics
413
(12.4)
This is another sample from the same population, and a result of inherent
variability, we observe that S2 is dierent from S1 . Nevertheless, this new
sample also contains information about the unknown characteristic parameter,
p.
Next consider another coin, this time, one characterized by p = 0.8; and
suppose that after performing the three-coin toss experiment say n = 12 times,
we obtain:
(12.5)
S3 = {3, 2, 2, 3, 1, 3, 2, 3, 3, 3, 2, 2}
As before, this set of results is considered to be just one of an innite number
of other samples that could potentially be drawn from the population characterized by p = 0.8; and, as before, it also contains information about the
value of the unknown population parameter.
We may now note the following facts:
1. With probability, the random variable space for this example is nite,
speciable `
a-priori, and remains as given in Eq (12.2) whether p = 0.5,
or 0.8, or any other admissible value.
2. Not so with the population: it is innite, and its elements depend on
the specic value of the characteristic population parameter. Sample S3
above, for instance, indicates that when p = 0.8, the population of all
possible observations from the three-coin toss experiment will very rarely
contain the number 1. (If the probability of observing a tail in a single
toss is this high, the number of tails observed in three tosses will very
likely consistently exceed 1.) This is very dierent from what is indicated
by S2 and S1 , being samples from a population of results observable from
tossing a so-called fair coin with no particular preference for tails over
heads.
3. Information about the true, but unknown, value of the characteristic
parameter, p, associated with each coins population is contained in each
nite data set.
Let us now translate this illustration to a more practical problem.
Consider an in-vitro fertilization (IVF) treatment cycle involving a 35-year
old patient and the transfer of 3 embryos. In this case, the random variable, X,
is the resulting number of live births; and, assuming that each embryo leads
either to a single live birth or not (i.e. no identical twins from the same egg)
the possible outcomes are 0, 1, 2, or 3, just as in the three-coin toss illustration,
with the random variable space, VX , as indicated in Eq (12.2). Recall from
414
Random Phenomena
Chapter 11 that this X is also a binomial random variable, implying that the
pdf in this case is also as given in Eq (12.1).
Now, suppose that this particular IVF treatment resulted in fraternal
twins, i.e. x = 2 (two individual embryos developed and were successfully
delivered). This value is considered to be a single sample from a population
that can be understood in one of two ways: (i) from a so-called frequentist
perspective, the population in this case is the set of all actual IVF treatment
outcomes one would obtain were it possible to repeat this three-embryo transfer treatment an innite number of times on the same patient; (ii) The population could also be conceived equivalently as the collection of the outcomes
of the same three-embryo IVF treatment carried out on an innite number of
identically characterized patients. In this regard, the 10-observation sample
S2 would result from sampling nine more of such patients treated identically, whose pregnancy outcomes are 1, 1, 0, 3, 2, 2, 1, 2, 1, in addition to the
already noted outcome x = 2 from the rst patient.
With this more practical problem, as with the coin-toss experiments, the
pdf is known, but the parameter p is unknown; data is available, but in the
form of nite-sized samples, whereas the full ensemble characterization we
seek is of the entire population. We are left with no other choice but to use
the samples, even though nite in size, to characterize the population. In these
illustrations, this amounts to determining, from the sample data, a reasonable
estimate of the true but unknown value of the parameter, p, for the specic
problem at hand.
Observe therefore that in solving practical problems, the population the
full observational manifestation of the random variable, X is that ideal,
conceptual entity one wishes to characterize. The objective of random phenomena analysis is to characterize it completely with the pdf f (x|), where
represents the parameters characteristic of the specic population in question.
However, because it is impossible to realize the population in its entirety via
experimentation, one must therefore settle for characterizing it by drawing
statistical inference from a nite sample subset. Clearly, the success of such
an endeavor depends on the sample being representative of the population.
Statistics therefore involves not only systematic analysis of (necessarily nite)
data, but also systematic data collection in such a way that the sample truly
reects the population, thereby ensuring that the sought-after information
will be contained in the sample.
12.1.3
Introduction to Statistics
415
PROBABILITY
Population
Sample
f(y|T)
{yi}iN
EXPERIMENTAL
DESIGN
STATISTICS
FIGURE 12.1: Relating the tools of Probability, Statistics and Design of Experiments
to the concepts of Population and Sample
problem, however, only a nite set of observations {xi }ni=1 is available; the
specic characterizing parameter vector, , is unknown and must rst be determined before probabilistic analysis can be performed. By considering the
available nite set of observations as a sample from the (idealized) complete
observational characterization of the random variable X, i.e. the population,
the tools of Statistics make it possible to characterize the population fully
(determine the functional form and estimate the unknown parameter in the
pdf f (x|)) from information contained in the sample.
Finally, because the population is to be characterized on the basis of the
nite sample, Design of Experiment provides the tools for ensuring that the
sample is indeed representative of the population so that the information
required to characterize the population adequately is contained in the sample.
12.1.4
Statistical Analysis
416
Random Phenomena
Descriptive Statistics
In descriptive statistics, the primary concern is the presentation, organization and summarization of sample data, with emphasis on the extraction of
information contained in a specic sample. Because the issue of generalization
from sample to population does not arise, no explicit consideration is given to
whether or not the sample is representative of the population. There are two
primary aspects:
1. Graphical : The use of various graphical means of organizing, presenting
and summarizing the data;
2. Numerical : The use of numerical measures, and characteristic values to
summarize data.
Such an approach to data analysis is often referred to as Exploratory Data
Analysis (EDA)1 .
Inductive Statistics
Also known variously as Statistical Inference or Inferential Statistics,
it is primarily concerned with drawing inference about the population from
sample data. As such, the focus is on generalization from the current data
set (sample) to the larger population. The two primary aspects of inductive
statistics are:
Introduction to Statistics
417
12.2
418
Random Phenomena
Introduction to Statistics
419
TABLE 12.1:
12.3
12.3.1
420
Random Phenomena
250
Total Number
200
150
100
50
Eye
Hand
Arm
Type of Injury
Back
Other
250
Total Number
200
150
100
50
Eye
Hand
Back
Type of Injury
Arm
Other
FIGURE 12.3: Bar chart of welding injuries arranged in decreasing order of number
of injuries
Introduction to Statistics
421
600
100
80
400
60
300
40
Percent
Total Number
500
200
20
100
0
Type of Injury
Total Number
Percent
Cum %
Eye
242
40.3
40.3
Hand
130
21.7
62.0
Back
117
19.5
81.5
Arm
64
10.7
92.2
Other
47
7.8
100.0
422
Random Phenomena
Category
Ey e
Hand
Arm
Back
other
other
7.8%
Back
19.5%
Ey e
40.3%
A rm
10.7%
Hand
21.7%
TABLE 12.2:
Frozen Ready
meals in France, in 2002
Type of Food
Percent
29.5
25.0
16.6
9.4
6.8
6.3
4.7
1.7
Introduction to Statistics
423
30
25
Percent
20
15
10
5
0
ch
en
Fr
l
na
io
eg
ea
M
ls
ed
ok
Co
sh
Fi
a
st
Pa
d
se
Ba
ea
M
ls
an
pe
ro
Eu
ea
M
ls
de
Si
es
sh
Di
ed
ok
Co
Se
d
oo
af
ic
ot
Ex
o
Co
d
ke
ls
ai
Sn
Type of Food
FIGURE 12.6: Bar Chart of frozen ready meals sold in France in 2002
Solution:
The bar chart is shown in Fig 12.6, the pie chart in Fig 12.7. Keeping in
mind that they are primarily used for visual communication of numerical facts, here are the most salient aspects of these charts: (i) It is not
very easy to see from the pie chart wedges (without the added numbers)
which category, French Regional Meals, or Cooked Fish, accounts for the
larger proportion of the overall sales, since both wedges look about the
same in size; with the bar chart, there is no question, even if the bars in
question were not situated side-by-side. (ii) However, if one is familiar
with reading pie charts, it is far easier to see that the Cooked Fish
category accounts for precisely a quarter of the total sales (the right
angle subtending the Cooked Fish wedge is the key visual cue); this
fact is not at all obvious from the bar chart which is much less eective
at conveying relative-to-the-whole information. (iii) French regional
meals sold approximately 20 times more than Cooked snails; even with
the attached numbers, this fact is much easier to appreciate in the bar
chart than in the pie chart.
Thus, observe that while the pie chart excels at conveying relative-to-thewhole information (especially if the relative proportions in question are 25%,
50%, or 75% entities whose angular representations are easily recognizable
by the unaided eye), the bar chart is weak in this regard. Conversely, the bar
chart conveys relative-to-one-another information far better than the pie
chart.
424
Random Phenomena
C ook ed Snails
Exotic 1.7%
Cook ed Seafood
6.3%
4.7%
Side Dishes
6.8%
C ategory
French Regional Meals
C ook ed Fish
Pasta Based Meals
European Meals
Side Dishes
C ook ed Seafood
Exotic
C ook ed Snails
European Meals
9.4%
FIGURE 12.7: Pie Chart of frozen ready meals sold in France in 2002
12.3.2
Frequency Distributions
As previewed in Chapter 1, even quantitative data can be quite uninformative if presented in raw form, as numbers in a table. One of the rst steps
in making sense of the information contained in raw quantitative data is to
rearrange the data, dividing them into smaller groups, or classes (also known
as bins) and attaching to each group a number representing how many of
the raw data belong to that group (i.e. how frequently members of that group
occur in the raw data set). The result is a frequency distribution representation of the data. When the frequency i associated with each group i
is normalized by the total number of data points, n, in the sample set, we
obtain the relative frequency, fi , for each group. For example, a specic
reorganization of the yield data, YA , presented in Chapter 1 gives rise to the
frequency distribution shown in Table 1.3 of Chapter 1 and reproduced here
(in Table 12.3) for easy reference. Compared to the raw data in Table 1.1,
this is a more compact and more informative representation of the original
data. Of course, such compactness is achieved at the expense of some details,
but this loss is more than compensated for by a certain enhanced clarity with
which the true character of the random variation begins to emerge from this
frequency distribution. For example, the frequency distribution shows clearly
how much of the data clusters around the group [74.51-75.50], an important
characteristic that is not readily evident from the raw data table.
A plot of this frequency distribution using vertical bars whose heights
are proportional to the frequencies (or, equivalently, the relative frequencies)
is known as a histogram, with the one corresponding to the YA frequency
distribution shown in Fig 1.1 and repeated here (in Fig 12.8) for ease of
reference.
Introduction to Statistics
425
TABLE 12.3:
Group classication
and frequencies for YA data
Relative
YA group Frequency Frequency
i
fi
71.51-72.50
1
0.02
2
0.04
72.51-73.50
9
0.18
73.51-74.50
74.51-75.50
17
0.34
7
0.14
75.51-76.50
8
0.16
76.51-77.50
77.51-78.50
5
0.10
1
0.02
78.51-79.50
TOTAL
50
1.00
Histogram of YA
18
16
14
Frequency
12
10
8
6
4
2
0
72
73
74
75
76
77
78
79
YA
426
Random Phenomena
Although by far the most popular, the histogram is not the only graphical
means of representing frequency distributions. If, instead of using adjoining
bars to represent group frequencies, we employed a cartesian plot in which each
group is represented by its center value on the x-axis and the corresponding
frequency plotted on the y-axis, with the points connected by straight lines,
the result is known as a frequency polygon as shown in Fig 12.9 for the YA
data. This is only a slightly dierent rendering of the information contained
in the histogram. Fig 12.10 shows the corresponding frequency polygon for
the YB data of Chapter 1 (whose histogram is shown in Fig 1.4).
As alluded to in Chapter 1, such graphical representations of the data
provide an empirical and partial, as opposed to complete approximation
to the true underlying distribution; but they show features not immediately
apparent from the raw data. Because data sets are incomplete samples, the
corresponding frequency distributions show irregularities; but, as the sample
size n , these irregularities gradually diminish, so that these empirical
distributions ultimately approach the complete population distribution of the
random variable in question. These facts inform the frequentist approach
to statistics and data analysis.
Some nal points to note about frequency distributions: It should be clear
that to be meaningful, histograms must be based on an ample amount of data;
only then will there be a sucient number of groups, with enough members
per group, to display the data distribution meaningfully. As such, whenever
possible, one should avoid employing histograms to display data sets containing fewer than 15-20 data points.
It should also be clear that the choice of bin size will aect the general appearance of the resulting histogram. Bins that are too wide generate
fewer groups and the resulting histograms cannot adequately reveal the true
distributional characteristics of the data. As an extreme example, a bin size
covering the entire range of a data set containing a total of n data points will
produce a histogram consisting of a single vertical bar of height n. Of course,
this is totally uninformative at least no more informative than the raw data
table because the entire data set will remain conned to this one group.
On the other extreme, if the bin size is so small that, with the exception of
exactly identical data entries, each data point ts into a group all by itself,
the result is an equally uninformative histogram uninformative for a complementary reason: this time, the entire data set is stretched out horizontally
into a collection of n bars, all of the same unit height. Somewhere between
these two obviously untenable extremes lies an acceptable bin size, but there
are no hard-and-fast rules for choosing it. The rule-of-thumb is that any choice
resulting in 8 10 groups is considered as acceptable.
In practice, transforming raw data sets into frequency distributions and
the accompanying graphical representations is almost always carried out by
computer programs such as MINITAB, Matlab, SAS, etc. And these software
packages are preprogrammed with algorithms that automatically choose reasonable bin sizes for each data set. The traditional recommendation for the
Introduction to Statistics
427
Frequency Polygon of YA
18
16
14
Frequency
12
10
8
6
4
2
0
72
73
74
75
76
77
78
79
YA
(12.6)
Thus, for instance, for the yield data, with n = 50, the recommendation
will be 7 intervals. The histogram in Fig 12.8, generated automatically with
MINITAB, uses 8 intervals.
12.3.3
Box Plots
H.A., (1926). The choice of a class interval, J. Am. Stat. Assoc., 21, 65-66.
428
Random Phenomena
Frequency Polygon of YB
9
8
7
Frequency
6
5
4
3
2
1
0
68
70
72
74
76
78
YB
(12.7)
(12.8)
With this denition, the lower whisker is drawn as a line extending from the
bottom of the box (i.e from Q1 ) to the smallest data value so long as it falls
within the lower limit. In this case, the end of the bottom whisker is therefore
the data minimum. Any data value that falls outside the lower limit is agged
as a potential outlier in this case, an unusually small observation and
represented with an asterisk. The upper whisker is drawn similarly: from the
top of the box, Q3 , to the largest data value within the upper limit. All data
Introduction to Statistics
429
80
78
76
Data
74
72
70
68
66
YA
YB
430
Random Phenomena
1
0
0
-1
-1
-2
-2
FIGURE 12.12: Boxplot of random N(0,1) data: original set, and with added outlier
TABLE 12.4:
Introduction to Statistics
431
35
30
25
Data
20
15
10
5
0
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
12.3.4
Scatter Plots
When the values of one variable are plotted on the y-axis versus the values
of another variable on the x-axis, the result is known as a scatter plot. The
plot is so-called because, unless the one variable is perfectly correlated with
the other, the data appear scattered in the plot. Such plots provide visual
clues as to whether or not there truly is a relationship between the variables,
and if there is one, what sort of relationship strong or weak, linear or
nonlinear, etc. Although not necessarily always the case, the variable plotted
on the y-axis is usually the one that may potentially be responding to the
other variable, which will be plotted on x-axis. It is also possible that a causal
relationship may not exist between the plotted variables, or if there is one,
it may not always be clear which variable is responding and which is causing
the response. Because these plots are truly just exploratory, care should be
taken not to over-interpret them; it is especially important not to jump to
conclusions that any observed apparent relationship implies causality.
A series of scatter plots are shown below, beginning with Fig 16.6, a plot
of the cranial circumference and corresponding nger length for various individuals. Believe it or not, there once was a time when people speculated
that these two variables correlate. This plot shows that, at least for the individuals involved in the particular study generating the data, there does not
appear to be any clear relationship between these two variables. Even if the
plot had indicated a strong relationship between the variables, observe that in
this case, none of these two variables can be rightly considered as dependent
on the other; thus, the choice of which variable to plot on which axis is purely
arbitrary.
Next, consider the data shown in Table 12.5, city and highway gasoline
432
Random Phenomena
61
Cranial Circumference(cm)
60
59
58
57
56
55
54
53
52
7.50
7.75
8.00
8.25
Finger Length(cm)
8.50
8.75
FIGURE 12.14: Scatter plot of cranial circumference versus nger length: The plot
shows no real relationship between these variables
mileage ratings, in miles per gallon (mpg), for 20 types of two-seater automobiles, complete with engine characteristics, capacity (in liters) and number of
cylinders. First, a plot of city gas mileage versus highway gas mileage, shown
in Fig 12.15, indicates a very strong, positive linear relationship between these
two variables. However, even though related, it is clear that this relationship is
not causal in the sense that one cannot independently and directly manipulate say city gas mileage and as a direct consequence thereby cause highway
gas mileage to change. Rather, both variables depend in common on other
factors that can be independently and directly manipulated (e.g., engine capacity).
Fig 12.16 shows a plot of highway gas mileage against engine capacity,
indicating an approximately linear and negative relationship. Observe that
according to the thermodynamics of internal combustion engines, the physics
of work done by applying force to move massive objects, and the fact that
larger engines are normally required for bigger cars, it is entirely logical that
smaller engines correlate with higher highway gas mileage. Fig 12.17 shows
a corresponding plot of highway gas mileage versus the number of cylinders.
This plot also indicates a similar negative, and somewhat linear relationship
between the variables. (Because city and highway gas mileage values are so
strongly correlated, similar plots of the city mileage data should show characteristics similar to the corresponding highway mileage plots. See Exercise
12.15)
Scatter plots are also very good at pointing out data that might appear
inconsistent with others in the group. For example, in Fig 12.16, two data
points for engine capacities 7 liters and 8.4 liters are associated with highway
Introduction to Statistics
433
TABLE 12.5:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
22
20
MPG City
18
16
14
12
10
8
15
20
MPG Highway
25
30
FIGURE 12.15: Scatter plot of city gas mileage versus highway gas mileage for various
two-seater automobiles: The plot shows a strong positive linear relationship, but no
causality is implied.
434
Random Phenomena
30
MPG Highway
25
20
15
4
5
6
Engine Capacity (Liters)
FIGURE 12.16: Scatter plot of highway gas mileage versus engine capacity for various
two-seater automobiles: The plot shows a negative linear relationship. Note the two unusually high mileage values associated with engine capacities 7.0 and 8.4 liters identied
as belonging to the Chevrolet Corvette and the Dodge Viper, respectively.
30
MPG Highway
25
20
15
10
Cylinders
12
14
16
FIGURE 12.17: Scatter plot of highway gas mileage versus number of cylinders for
various two-seater automobiles: The plot shows a negative linear relationship.
Introduction to Statistics
435
300
US Population
250
200
150
100
50
0
1800
1850
1900
Census Year
1950
2000
FIGURE 12.18: Scatter plot of US population every ten years since the 1790 census
versus census year: The plot shows a strong non-linear trend, with very little scatter,
indicative of the systematic, approximately exponential growth
gas mileage values of 24 and 22 miles per gallon, respectively values that
appear unusually high for such large engines, especially when compared to corresponding values for other automobiles with engines of similar volume. An
inspection of the data table indicates that these values belong to the Chevrolet Corvette and the Dodge Viper models respectively automobiles whose
bodies are constructed from special berglass composites, resulting in vehicles
that are generally lighter in weight than others in their class. The scatter plot
correctly shows these data to be inconsistent with the rest, and we are able
to provide a logical reason for the unusually high gas mileage: lighter cars,
even those with large engines, generally get better gasoline mileage.
When the variable on the x-axis is time, the plot provides an indication
of any trends that may exist in the y variable. Examples of such plots include
monthly sales volume of a particular item; daily closing values of stocks on the
stock exchange; monthly number of drunk driving arrests on a municipal road,
etc. These are all plots that indicate time trends in the variables of interest.
Fig 12.18 shows a plot of the populations of the United States of America
as determined by the decade-by-decade census from 1790 to 2000. This plot
shows the sort of exponential growth trend that is typical of populations of
growing organisms. We revisit this data set in Chapter 20.
436
12.4
Random Phenomena
Numerical Descriptions
Characterizing data sets by empirical frequency distributions, while useful for graphically condensing the information contained in the data into a
relatively small number of groups, is not particularly useful for carrying out
quantitative comparisons. Such comparisons require quantitative numerical
descriptors of the data characteristics, typically measures of (i) central tendency (or data location); (ii) variability (or spread); (iii) skewness, and (iv)
peakedness. It is not coincidental that these common numerical descriptors
align perfectly with the characteristic moments of theoretical distributions
discussed in Chapter 4. In statistical analysis, these numerical descriptors are
computed from sample data as single numbers used to represent various aspects of the entire data set; they are therefore numerical approximations of the
corresponding true but unknown population distribution parameters. Given
sample data, all such numerical descriptors are, of course, routinely computed
by various statistical analysis software packages; the following discussion simply provides some perspective on the most common of these descriptors. In
particular, we demonstrate the sense in which they are to be considered as
appropriate measures, and hence clarify the context in which they are best
utilized.
12.4.1
(12.9)
"
!
E X 2 2Xc + c2
E[X 2 ] 2cE[X] + c2
(12.11)
Introduction to Statistics
437
because c is a constant. Employing the standard tools of calculus dierentiating in Eq (12.11) with respect to c, setting the result to zero and solving
for c yields:
d2
= 2E[X] + 2c = 0
(12.12)
dc
giving the immediate (and not so surprising) result that
c = E[X] =
(12.13)
Thus, the mean is the best single representative of the theoretical centroid of
a random variable if we are concerned with minimizing mean squared deviation
from all possible values of X.
The Median
When n = 1 in Eq (12.9), we wish to nd c to minimize the mean absolute
deviation between it and the possible values of the random variable, X. For
the continuous random variable, by denition,
|x c|f (x)dx
(12.14)
1 =
so that
1
=
c
|x c|f (x)dx = 0
c
(12.15)
1; x > c
|x c| =
(12.16)
1; x < c
c
As a result, Eq (12.15) becomes:
c
0=
f (x)dx
f (x)dx
(12.17)
where the rst term represents the integral over the region {x : x < c} and
the second term is the integral over the remaining region {x : x > c}. This
equation has the obvious solution:
c
f (x)dx =
f (x)dx,
(12.18)
438
Random Phenomena
(12.20)
where xm is the median. It is left as an exercise for the reader (see Exercise
12.18) to establish the result for discrete X.
Thus, the median is the central representative of X that gives the smallest
mean absolute deviation from all possible values of X.
The Mode
It is shown in the Appendix at the end of this chapter that when n = ,
c = x
(12.21)
where x is the mode of the pdf f (x), minimizes the objective function in Eq
(12.9). The mode is therefore the central representative of X that provides
the smallest of the worst possible deviations from all possible values of X.
This discussion puts into perspective the three most popular measures
of central location of a random variable the mean, median, and mode
their individual theoretical properties and what makes each one a good
measure. Theoretically, for all symmetric distributions, the mean, mode and
median all coincide; they dier (sometimes signicantly) for nonsymmetric
distributions. The sample data equivalents of these population parameters
are obtained as discussed next.
12.4.2
Sample Mean
From a sample x1 , x2 , x3 , . . . , xn , the sample mean, or the sample average,
x
, is dened as:
n
1
x
=
xi
(12.22)
n i=1
In terms of the just-concluded theoretical considerations, this implies that of
all possible candidate values, c, the sample average, is therefore that value
which minimizes the mean squared error between the observed realizations of
X and c. This quantity is sometimes referred to as the arithmetic mean to
distinguish it from other means. For example, the geometric mean, x
g , dened
as
n 1/n
x
g =
xi
,
(12.23)
i=1
is sometimes preferred for representing the centrum of data from skewed distributions such as the lognormal distribution. Observe that taking logarithms
in Eq (12.23) establishes that the log of the geometric mean is the arithmetic
mean of the log transformed data.
Introduction to Statistics
439
(12.24)
is more appropriate for data involving rates, ratios, or any phenomenon where
the true variable of concern occurs naturally as a reciprocal entity. The classic example is data involving velocities: if a particle covers a xed distance,
s, at varying velocities x1 and x2 , from elementary physics, we are able to
deduce that the average velocity with which it covers this distance is not the
arithmetic mean (x1 + x2 )/2, but the harmonic mean. This, of course, is because, with the distance xed, the consequence of the variable velocity is a
commensurate variation in the time to cover the distance, a reciprocal of the
velocity. Note that if the time of travel, as opposed to the distance, is the
xed quantity, then the average velocity will be the arithmetic mean.
In general the following relationship holds between these various sample
averages:
(12.25)
x
h < xg < x
Note how, by denition, the arithmetic mean is susceptible to undue inuence
from extremely large observations; by the same token, the reverse is the case
with the harmonic mean which is susceptible to the undue inuence of unusually small observations (whose reciprocals become unusually large). Such
inuences are muted with the geometric mean.
Sample Median
Let us begin by reordering the observations x1 , x2 , . . . , xn in ascending
order to obtain x(1) , x(2) , . . . , x(m) , . . . x(n) (we could also do this in descending
order instead); if n is odd, the middle number, x(m) , is the median, where
m = (n + 1)/2.
If n is even, let m = n/2; then the median is the average of the two middle
numbers x(m) and x(m+1) , i.e.
xm =
x(m) + x(m+1)
2
(12.26)
Because the median, unlike the means, does not involve carrying out any
arithmetic operation on the extreme values, x(1) and x(n) , it is much less
susceptible to unusually large or unusually small observations. The median is
therefore quite robust against outliers. Nevertheless, because it does not utilize
all the information contained in the sample data set, it is more susceptible to
chance uctuations.
Sample Mode
The sample mode can only be obtained directly from the frequency distribution.
440
12.4.3
Random Phenomena
Measures of Variability
(12.27)
(note that
n
i=1
di = xi x
(12.28)
1
|di |,
d =
n i=1
(12.29)
di = 0); then
n
than d.
Sample Variance and Standard Deviation
The mean squared deviation from the sample mean, dened as:
1
(xi x
)2 ,
n 1 i=1
n
s2 =
(12.31)
Introduction to Statistics
441
TABLE 12.6:
)2
i=1 (xi x
(12.32)
s = + s2 =
n1
is the sample standard deviation; it has the same unit as x, as opposed to s2
which has the unit of x squared.
Example 12.4 SUMMARY DESCRIPTIVE STATISTICS
FOR YIELD DATA SETS OF CHAPTER 1
First obtain, then compare and contrast, summary descriptive statistics
for the yield data sets YA and YB presented in Chapter 1 (Table 1.1).
Solution:
The summary descriptive statistics for these data sets, (obtainable using
any typical software package), are shown in Table 12.6.
The computed average is higher for YA than for YB , but the standard deviation (hence also the variance) is lower for YA ; the very low
skewness values for both data sets indicates a lack of asymmetry. The
values shown above for kurtosis are actually for the so-called excess
kurtosis dened as (4 3), which will be zero for a perfectly Gaussian
distribution; the computed values shown here indicate that both data
distributions are essentially Gaussian.
The remaining quantitative descriptors make up the ve-number
442
Random Phenomena
summary used to produce the typical box-plot; jointly, they indicate
what we already know from Fig 12.11: in every single one of these categories, the value for YA is consistently higher than the corresponding
value for YB .
The modes, which cannot be computed directly from data, are
yB = 70 for the specic bin sizes used to generate the respective histograms/frequency polygons (see Figs 12.8 and Fig 1.2 in Chapter 1; or
Figs 12.9, 12.10).
12.4.4
It is easy to misconstrue the two aspects of descriptive statistics graphical techniques and numerical descriptors as mutually exclusive; or, at the
very least, that the former is more useful for qualitative data while the latter
is more useful for quantitative data. While there is an element of truth to
this latter statement, it is more precise to consider the two aspects rather as
complementary. Graphical techniques are great for conveying a general sense
of the information contained in the data but they cannot be used for quantitative analysis or comparisons. On the other hand, even though these graphical
techniques are not quantitative, they provide insight into the nature of the
data set that mere numbers alone cannot possibly convey. One is incomplete
without the other.
To illustrate this last point, we now present a classic example due to
Anscombe3 . The example involves four data sets, the rst of which is shown
in Table 12.7.
The basic numerical characteristics of X1 and Y1 are as follows: Total
number, n = 11 for both variables; the averages: x
1 = 9.0; y1 = 7.5; the
standard deviations: sx1 = 3.32; sy1 = 2.03; and the correlation coecient
3 Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician,
pp. 195-199.
Introduction to Statistics
443
TABLE 12.7:
The Anscombe data
set 1
X1
Y1
10.00
8.04
8.00
6.95
13.00
7.58
9.00
8.81
11.00
8.33
14.00
9.96
6.00
7.24
4.00
4.26
12.00
10.84
7.00
4.82
5.00
5.68
TABLE 12.8:
2, 3, and
X2
10.00
8.00
13.00
9.00
11.00
14.00
6.00
4.00
12.00
7.00
5.00
4
Y2
9.14
8.14
8.74
8.77
9.26
8.10
6.13
3.10
9.13
7.26
4.74
X3
Y3
X4
Y4
10.00 7.46 8.00 6.58
8.00 6.77 8.00 5.76
13.00 12.74 8.00 7.71
9.00 7.11 8.00 8.84
11.00 7.81 8.00 8.47
14.00 8.84 8.00 7.04
6.00 6.08 8.00 5.25
4.00 5.39 19.00 12.50
12.00 8.15 8.00 5.56
7.00 6.42 8.00 7.91
5.00 5.73 8.00 6.89
444
Random Phenomena
11
10
9
Y1
8
7
6
5
4
5.0
7.5
10.0
12.5
15.0
X1
and on this basis alone, one might be tempted to conclude that the four data
sets must somehow be equivalent. Of course this is not the case. Yet, how
truly dierent the data sets are becomes quite obvious by mere inspection of
the scatter plots shown in Figs 12.20, 12.21, and 12.22 when compared to Fig
12.19.
Thus, while good for summarizing data with a handful of quantitative characteristics, numerical descriptions are necessarily incomplete; they can (and
often) omit, or lter out, important distinguishing features in the data sets.
For a complete view of the information contained in any data set, it is important to supplement quantitative descriptions with graphical representations.
Introduction to Statistics
445
10
9
8
Y2
7
6
5
4
3
5.0
7.5
10.0
12.5
15.0
X2
13
12
11
Y3
10
9
8
7
6
5
5.0
7.5
10.0
12.5
15.0
X3
446
Random Phenomena
13
12
11
Y4
10
9
8
7
6
5
8
10
12
14
X4
16
18
20
12.5
This chapter was designed to serve as a transitional link between probability and statistics. By rst articulating the central issue in statistical analysis
(characterizing randomly varying phenomena on the basis of nite data) the
case was made for why statistics must rely on probability even as it complements it. This led to the introduction of the basic concepts involved in
statistics and an overview of what the upcoming detailed study of statistics
entails. Compared to what is covered in the rest of Part IV, this chapters
introduction to descriptive statisticsorganization, graphical representation
and numerical summarization of sample datamay have been brief, but it is
no less important. It is worth reiterating therefore that numerical analysis is
most eective when complemented (wherever possible) with graphical plots.
Here are some of the main points of the chapter again.
Statistics is concerned with fully characterizing randomly varying phenomena on the basis of nite sample data; it relies on probability to
quantify the inevitable uncertainty associated with such an endeavor.
The three central concepts of statistical analysis are:
Population: the complete collection of all the data obtainable from
an experiment; unrealizable in practice, it is to statistics what the
sample/random variable space is to probability;
Sample: specic observations of nite size; a subset of the population;
Introduction to Statistics
447
APPENDIX
We wish to nd the value of c that minimizes the objective function:
(x c)p f (x)dx
(12.33)
= lim p = lim
p
for the continuous random variable, X. Upon taking derivatives and equating
to zero, we obtain:
= lim p
(x c)p1 f (x)dx = 0
(12.34)
p
c
= 0 = lim f (x)(x c)p | + lim
p
p
c
The rst term on the RHS of the equality sign after 0 vanishes because, for
all valid pdfs, f () = f () = 0, so that Eq (12.35) reduces to:
= 0 = lim
(x c)p f (x)dx
(12.36)
p
c
(12.37)
and this occurs at the mode of the pdf, f (x). It is left as an exercise to the
reader to obtain the corresponding result for a discrete X (See Exercise 12.19).
448
Random Phenomena
REVIEW QUESTIONS
1. As presented in this chapter, what are the three distinct but related entities involved in a systematic study of randomly varying phenomena?
2. What is the dierence between X, the random variable, and n individual observations, {xi }n
i=1 ?
3. What is the dierence between writing a pdf as f (x) and as f (x|), where is
the vector of parameters?
4. Why is the theoretical description f (x) for any specic randomly varying phenomena never completely available?
5. With probability analysis, which of these two entities, f (x) and {xi }n
i=1 , is available and what is to be determined with it? With statistical analysis, on the other
hand, which of the two entities is available and what is to be determined with it?
6. Statistics is a methodology for doing what?
7. As stated in Section 12.1.2, what three concepts are central to statistical analysis?
8. What is the dierence between a sample and a population?
9. What is statistical inference?
10. Which of the following two entities can be specied `
a-priori, and which is an
`-posteriori entity: (a) the random variable space, VX , in probability, and (b) the
a
population in statistics?
11. Why must one settle for characterizing the population by drawing statistical
inference?
12. How does systematic data collection t into statistical inference?
13. What is the connection between probability, statistics and design of experiments?
14. Into which two categories is statistics primarily categorized?
15. What is Descriptive Statistics?
16. What is Inductive Statistics?
17. What does Design of Experiments entail?
18. What is the dierence between a qualitative and a quantitative variable?
Introduction to Statistics
449
450
Random Phenomena
EXERCISES
Section 12.1
12.1 Consider the experiment of tossing a single six-faced die and recording the
number shown on the top face after the die comes to rest.
(i) What is the random variable in this case, and what is the random variable space?
(ii) What is the population and what will constitute a sample from this population?
(iii) With adequate justication, postulate a probability model for this random variable.
12.2 Consider a chess player who is participating in a two-game, pre-tournament
qualication series where the outcome of each game is either a win, a loss, or a draw.
(i) If we are interested only in the total number of wins and the total number of
draws, what is the random variable in this case, its dimensionality, and the associated
random variable space?
(ii) Describe the population in this case and what will constitute a sample from such
a population.
(iii) With adequate justication, postulate a reasonable probability model for this
random variable. What are the unknown parameters?
12.3 Lucas (1985)4 studied the number and frequency of occurrences of accidents
over a 10-year period at a DuPont company facility. If the variable of interest is the
time between occurrences of these accidents, describe the random variable space,
the population, and what might be considered as a sample. Postulate a reasonable
probability model for this variable and note how many parameters it has.
12.4 In studying the useful lifetime (in years) of a brand of washing machines with
the aid of a Weibull probability model, indicate the random variable space, the population and the population parameters. How can one go about obtaining a sample
{xi }50
i=1 from this population?
Section 12.2
12.5 Classify each of the following variables as either quantitative or qualitative; if
quantitative, specify whether it is discrete or continuous; if qualitative whether it is
ordinal or nominal.
(i) Additive Type, (A, B, or C); (ii) Additive Concentration, (moles/liter); (iii) Heat
Condition, (Low, Medium High); (iv) Agitation rate, (rpm); (v) Total reactor Volume, (liters); (vi) Number of reactors in ensemble, (1, 2, or 3).
12.6 The Lucas (1985) study of Exercise 12.3 involved the following variables:
(i) Period, (I, II); (ii) Length of Study, (Years); (iii) Number of accidents; (iv) Type
of Accident; (v) Time between accidents.
Classify each variable as either quantitative or qualitative; if quantitative, specify whether it is discrete or continuous; if qualitative whether it is ordinal or nominal.
4 Lucas
Introduction to Statistics
451
12.7 A study of the eect of environmental cues on cell adhesion involved the following variables. Classify each one as either quantitative (discrete or continuous) or
qualitative (ordinal or nominal).
(i) Type of stimulus, (Mechanical, Chemical, Topological); (ii) Ligand concentration;
(iii) Surface type (Patterned, Plain); (iv) Mechanical force; (v) Number of integrin
clusters; (vi) Integrin cluster size; (vii) Cell Status (Adherent, Non-Adherent).
Section 12.3
12.8 The table below shows where chemical engineers found employment in 2000,
categorized by degree. For each degree category, (BS, MS and PHD) draw a bar chart
and a pie chart. Comment on what stands out most prominently in each case within
each degree category and across the degree categories (for example, Academia
across the categories.)
Employer
Industry
Government
Grad/Professional School
Returned to Home Country
Unemployed
Unknown Employment
Academia
Other
BS
Placement %
55.9
1.7
11.2
1.3
9.5
18.8
0.0
1.8
MS
Placement %
44.1
2.2
33.1
4.7
4.5
7.4
1.8
2.2
PhD
Placement %
57.8
0.8
13.1
0.1
2.8
6.4
16.5
1.7
12.9 Generate Pareto Charts for each degree category for the chemical engineering
employment data in Exercise 12.8. Interpret these charts.
12.10 The data in the table below (adapted from a 1983 report from the Presidents
council on Physical Fitness and Sports) shows the number of adult Americans (nonprofessionals) participating in the indicated sports.
Type
of
Sport
Basketball
Bicycling
Football (Touch)
Golf
Hiking
Ice-skating
Racquetball
Roller-skating
Running
Skiing
Softball
Swimming
Tennis
Volleyball
Number of
Participants
(in millions)
29
44
18
13
34
11
10
20
20
10
26
60
23
21
Generate a bar chart and a Pareto chart for this data; interpret the charts. Why
452
Random Phenomena
WTD
Population
301,500
204,800
151,800
110,100
108,000
86,800
52,500
51,300
50,800
48,100
Generate a bar chart for the data in terms of relative frequency. In how many of
these 10 states will one nd approximately 80% of the listed well-to-do individuals?
Section 12.4
12.12 The data in the following table shows samples of size n = 20 drawn from
four dierent populations coded as N, L, G and I. Generate a histogram and a box
plot for each of the data sets. Discuss what these plots indicate about the general
characteristics of the population from which the data were obtained.
XN
9.3745
8.8632
11.4943
9.5733
9.1542
9.0992
10.2631
9.8737
7.8192
10.4691
9.6981
10.5911
11.6526
10.4502
10.0772
10.2932
11.7755
9.3790
9.9202
10.9067
XL
7.9128
5.9166
4.5327
33.2631
24.1327
5.4151
16.9556
3.9345
35.0376
25.1182
1.1804
2.3503
15.6894
5.8929
8.0254
16.1482
0.6848
6.6974
3.6909
34.2152
XG
10.0896
15.7336
15.0422
5.5482
18.0393
17.9543
12.5549
9.6640
14.2975
4.2599
19.1084
7.0735
7.6392
14.1899
13.8996
9.7680
8.5779
7.5486
10.4043
14.8254
XI
0.084029
0.174586
0.130492
0.115567
0.187260
0.100054
0.101405
0.100835
0.097173
0.141233
0.060470
0.127663
0.074183
0.086606
0.084915
0.242657
0.052291
0.116172
0.084339
0.205748
12.13 For each sample in Exercise 12.12, compute the (i) arithmetic mean; (ii) geometric mean; (iii) median; and (iv) harmonic mean. Which do you think is a more
Introduction to Statistics
453
appropriate measure of the central tendency of the original population from which
these samples were drawn, and why?
12.14 The table below shows a relative frequency summary of sample data on distances between DNA replication origins (inter-origin distances), measured by Li et
al., (2003)5 , with an in vitro Xenopus egg extract assay in Chinese Hamster Ovary
(CHO) cells, as reported in Chapter 7 of Birtwistle (2008)6 .
Inter-Origin
Distance (kb)
x
0
15
30
45
60
75
90
105
120
135
150
165
Relative
Frequency
fr (x)
0.00
0.09
0.18
0.26
0.18
0.09
0.04
0.03
0.05
0.04
0.01
0.02
(The data set is similar to, but dierent from, the one in Application Problem
9.40 in Chapter 9.) Obtain a histogram of the data and determine the mean, variance and your best estimate of the median.
12.15 From the data given in Table 12.5 in the text, generate a scatter plot of (i)
city gas mileage against engine capacity, and (ii) city gas mileage against number of
cylinders, for the two-seater automobiles listed in that table. Compare these plots to
the corresponding ones in the text for highway gas mileage. Are there any surprises
in these city gas mileage plots?
12.16 Let X1 and X2 represent, respectively, the engine capacity, in liters, and number of cylinders for the population of two-seater automobiles; let Y1 and Y2 represent
the city gas mileage and highway gas mileage, respectively, for these same automobiles. Consider that the data in Table 12.5 constitute appropriate samples from the
respective populations. From the supplied sample data, compute the complete set of
6 pairwise correlation coecients between these variables. Comment on what these
correlation coecients mean.
12.17 Conrm that the basic numerical characteristics of each (X, Y ) pair in Table
12.8 are as given in the text.
12.18 Determine the value of c that minimizes the mean absolute deviation 1
5 Li, F., Chen, J., Solessio, E. and Gilbert, D. M. (2003). Spatial distribution and specication of mammalian replication origins during G1 phase. J Cell Biol 161, 257-66.
6 M. R. Birtwistle, (2008). Modeling and Analysis of the ErbB Signaling Network: From
Single Cells to Tumorigenesis, PhD Dissertation, University of Delaware.
454
Random Phenomena
between it and the possible values of the discrete random variable, X, whose pdf is
given as f (xi ), i.e.,
|xi c|f (xi )dx
(12.38)
1 =
i=0
(12.39)
i=0
and show that it is the mode of the discrete pdf f (xi ), i.e., f (c) > f (xi ) for all i.
APPLICATION PROBLEMS
12.20 A quality control engineer at a semi-conductor manufacturing site is concerned about the number of contaminant particles (aws) found on each standard
size silicon wafer produced at the site. A sample of 20 silicon wafers selected and
examined for aws produced the result (the number of aws found on each wafer)
shown in the following table.
3
4
0
1
0
2
2
3
3
2
0
1
3
2
2
4
1
0
2
1
(i) For this particular problem, what is the random variable, X, the set {xi }n
i=1 ,
and why is the Poisson model, with the single parameter , a reasonable probability
model for the implied phenomenon?
(ii) From the expression for f (x|), compute the theoretical probabilities when the
population parameter is specied as 0.5, 1.0 and 1.5. From these theoretical probabilities, which of the postulated population parameters appears more representative
of observations?
12.21 The time in months between occurrences of safety violations in a toll manufacturing facility is shown in the table below.
1.31
1.94
0.79
0.15
3.21
1.22
3.02
2.91
0.65
3.17
1.66
3.90
4.84
1.51
0.18
0.71
0.30
0.57
0.70
0.05
7.26
1.41
1.62
0.43
2.68
6.75
0.96
0.68
1.29
3.76
(i) Determine the mean, median and variance for this sample data. Construct a
histogram and explain why the observed shape is not surprising, given the nature of
the phenomenon in question.
(ii) What is a reasonable probability model for the population from which the data
came? If the population parameter, the mean time between violations, is postulated
to be 2 months, compute the theoretical probability of going more than 2 months
without a safety violation. Is this theoretical probability compatible with this sample
data? Explain.
(iii) In actual fact, the data were obtained for three dierent operators and have been
Introduction to Statistics
455
arranged accordingly: the rst row is for Operator A, the second row, for Operator B, and the third row for Operator C. It has been a long-held preconception
in the manufacturing facility that Operator A is relatively more safety-conscious
than the other two. Strictly on the basis of any graphical and numerical descriptions
of each data set that you deem appropriate, is there any suggestion in this data
set that could potentially support this preconception? Explain.
12.22 Nelson (1989)7 quantied the cold cranking power of ve dierent battery
types in terms of the number of seconds that a particular battery generated its rated
amperage without falling below 7.2 volts, at 0 F. The experiment was repeated four
times for each battery type and the resulting data set is shown in the following table.
Battery Type
Experiment No
1
2
3
4
41
43
42
46
42
43
46
38
27
26
28
27
48
45
51
46
28
32
37
25
n, Number of
Carbon Atoms
1
2
3
4
5
6
7
8
Boiling Point
C
-162
-88
-42
1
36
69
98
126
What does this data set imply about the possibility of predicting the boiling
point of compounds in this series on the basis of the number of carbon atoms? Compute the correlation coecient between these two variables, even though the number
n is not a random variable. Comment on what the computed value indicates.
12.24 The following table contains experimental data on the thermal conductivity,
k (W/m- C), of a metal, determined at various temperatures.
7 Nelson,
456
Random Phenomena
k (W/m- C)
93.228
92.563
99.409
101.590
111.535
115.874
119.390
126.615
Temperature ( C)
100
150
200
250
300
350
400
450
What sort of systematic functional relationship between the two variables (if
any) does the evidence in the data suggest? Compute the correlation coecient between the two variables and comment on what this value indicates. What would you
recommend as a reasonable value to postulate for the thermal conductivity of the
metal at 325 C? Justify your answer succinctly.
12.25 The following data set, from the same study by Lucas (1985) referenced in
Exercise 12.3, shows the actual number of accidents occurring per quarter (three
months) separated into two periods: Period I is the rst ve-year period of the
study; Period II, the second ve-year period.
5
4
2
5
6
Period I
5 10 8
5 7
3
8 6
9
6 5 10
3 3 10
3
1
7
1
4
Period II
4 2 0
3 2 2
7 1 4
2 2 1
4 4 4
1960
20,321
18,692
16,773
13,219
10,801
10,869
11,949
12,481
11,600
10,879
9,606
8,430
7,142
16,560
1980
16,348
16,700
18,242
21,168
21,319
19,521
17,561
13,965
11,669
11,090
11,710
11,615
10,088
25,550
Introduction to Statistics
457
Identify any features that might be due to the baby-boom generationthose born
during the period from the end of World War II until about 1965.
458
Random Phenomena
Chapter 13
Sampling
459
460
461
461
462
463
463
463
465
465
467
467
470
472
476
477
478
482
If, as stated in Chapter 12, inductive (or inferential) statistics is primarily concerned with drawing inference about a population from sample information,
then a logical treatment of inductive statistics must begin with sampling
a formal study of samples from a population. Because it is a nite collection
of individual observations, a sample is itself susceptible to random variation
since dierent samples drawn from the same population under identical conditions will be dierent. As such, before samples can be useful for statistical
inference concerning the populations that produced them, the variability inherent in samples must be characterized mathematically (just as was done for
individual observations xi from a random variable, X). How one characterizes
the variability inherent in samples, as distinct from, but obviously related to,
characterizing the variability inherent in individual observations of a random
variable, X, is the focus in this chapter. Sampling is the foundational element
of statistical inference, and this chapters discussion is an indispensable precursor to the discussions of estimation and hypothesis testing to follow in the
next two chapters.
459
460
13.1
Random Phenomena
Introductory Concepts
As we now know, the role played by the sample space (or, equivalently,
the random variable space) in probability theory is analogous to that of the
population in statistics. In this regard, what the randomly varying individual
observation, x, is to the random variable space, VX , in probability theory, the
nite-sized sample, {xi }ni=1 , is to the population in statistics. In the former,
the variability inherent to individual observations is characterized by the pdf,
f (x), an ensemble representation that is then used to carry out theoretical
probabilistic analysis for the elements of VX . There is an analogous problem
in statistics: in order to characterize the population appropriately, we must
rst gure out how to characterize the variability intrinsic to the nite-sized
sample. The entire subject matter of characterizing and analyzing samples
from a population, and employing such results to make statistical inference
statements about the population, is known as sampling, or sampling theory.
The following three concepts,
1. The Random Sample;
2. The Statistic; and
3. The Distribution of a Statistic (or The Sampling Distribution),
are central to sampling. As discussed in detail shortly, sampling theory combines these concepts into the basis for characterizing the uncertainty in samples in terms of probability distributions that can then be used for statistical
inference.
13.1.1
Sampling
461
The condition noted here for the random variables is also sometimes rendered as independently and identically distributed or i.i.d.
The practical implication of this concept and the denition given above is
that if we can ensure that the sample from a population is drawn randomly,
then the joint distribution of the sample is a product of the contributing pdfs.
This rather simple concept signicantly simplies the theory and practice of
statistical inference, as we show shortly.
13.1.2
Denitions
A statistic is any function of one or more random variables that does
not depend upon any unknown population parameters. For example, let
X1 , X2 , . . . , Xn be mutually stochastically independent random variables, each
with identical N (, 2 ) distributions, with unknown parameters and ; then
the random variable Y dened as:
Y =
n
Xi
(13.1)
i=1
X1
(13.2)
462
Random Phenomena
1. Even though a statistic (say Y as dened above) does not depend upon
an unknown parameter, the distribution of a statistic (say fY (y)) quite
often depends on unknown parameters.
2. The distributions of such statistics are called sampling distributions because they are distributions of functions of samples. Since a statistic,
as dened above, is itself a random variable, its sampling distribution
describes the variability (chance uctuations) one will observe in it as a
result of random sampling.
Utility
The primary utility of the statistic and its distribution is in determining
unknown population parameters from samples, and in quantifying the inherent variability. This becomes evident from the following re-statement of the
problem of statistical inference:
1. The pdf for characterizing the random variable, X, f (x|), contains unknown parameters, ; were it possible to observe, via experimentation,
the complete population in its entirety, we would be able to construct,
from such observations, the complete f (x|), including the parameters;
however, only a nite sample from the population is available via experimentation. When the form of f (x|) is known we are left with the
issue of determining the unknown parameters, , from the sample, i.e.,
making inference about the population parameter from sample data.
2. We make these inferences by investigating random samples, using appropriate statistics (quantities calculated from the random sample)
that will provide information about the parameters.
3. These statistics, which enable us to determine the unknown parameters, are themselves random variables; the distribution of such statistics
then enable us to make probability statements about these statistics and
hence the unknown parameters.
It turns out that most of the unknown parameters in a pdf representing
a population are contained in the mean, , and variance, 2 , of the pdf in
question. Thus, once the mean and variance of a pdf are known, the naturally
occurring parameters can then be deduced. For example, if X N (, 2 ), the
mean and the variance are in fact the naturally occurring parameters; for the
gamma random variable, X (, ), recall that
=
2 =
(13.4)
(13.5)
2 / 2
2 /
(13.6)
(13.7)
Sampling
463
For the Poisson random variable, X P(), = 2 = , so that the parameter is directly determinable from either the mean or the variance (or
both).
Thus, it is often sucient to use statistics that represent the mean and the
variance of a population to determine the unknown population parameters.
It is therefore customary for sampling theory to concentrate on the sampling
distributions of the mean and of the variance. And now, because statistics are
functions of random variables, determining sampling distributions requires
techniques for obtaining distributions of functions of random variables.
13.2
The general problem of interest may be stated as follows: given the joint
pdf for n random variables X1 , X2 , . . . , Xn , nd the pdf fY (y) for the random
variable Y dened as
(13.8)
Y = g(X1 , X2 , . . . , Xn ).
13.2.1
General Overview
= X1 + X2
X1
=
X1 + X2
(13.9)
(13.10)
13.2.2
As we will soon see, many (but not all) classical statistical inference problems involve sampling from distributions that are either exactly Gaussian or
464
Random Phenomena
(13.12)
= k1 1 + k2 2 + + kn n
(13.13)
y2
(13.14)
i=1
n
ki2
(13.16)
i=1
(13.17)
y2
2 /n
(13.18)
Even if the distributions of Xi are not Gaussian, but the means and variances
are still i and i2 respectively, clearly, the resulting distribution of Y will be
determined by the underlying distributions of Xi , but its mean and variance,
y and y2 , will still be as given in Eqs (13.17) and (13.18). These results are
also fairly straightforward to establish (see See Excercise 13.5).
2. Sum of Squares of Standard Normal Variables: Consider a random
sample of size n from a Gaussian N (, 2 ) distribution, X1 , X2 , . . . , Xn ; the
random variable
2
n
Xi
(13.19)
Y =
i=1
has a 2 (n) distribution.
3. Sum of Chi-Square Random Variables: Consider n mutually stochastically independent random variables, X1 , X2 , . . . , Xn , with respective pdfs
2 (r1 ), 2 (r2 ), , 2 (rn ); the random variable
Y = X1 + X2 + + Xn
(13.20)
Sampling
465
(13.21)
13.3
(13.22)
is a random variable whose specic value, the actual sample average, x, will
vary from one specic random sample to another. What is the theoretical
13.3.1
If we know f (x), the distribution of the population from which the sample
is drawn, we can use the techniques discussed in Chapter 6 (and mentioned
The next two
above) to obtain the sampling distribution of the mean X.
examples illustrate this point for the Gaussian pdf and the exponential pdf.
Example 13.1: SAMPLING DISTRIBUTION: MEAN OF
RANDOM SAMPLE FROM GAUSSIAN DISTRIBUTION
If X1 , X2 , . . . , Xn is a random sample from a Gaussian distribution with
dened in Eq 14.69.
mean and variance 2 , nd the distribution of X
Solution:
First, X1 , X2 , . . . , Xn is a random sample from the same Gaussian distribution, whose characteristic function is:
1
(t) = exp jt 2 t2
(13.23)
2
By virtue of the independence of the random variables, and employing
results about the characteristic function of linear sums of independent
is obtained as
random variables, then the characteristic function of X
n
1 2 2
j
t
exp
t
(13.24)
X (t) =
n
2 n2
i=1
466
Random Phenomena
This product of n identical exponentials becomes an exponential of the
sums of the terms in the winged brackets, and nally simplies to give:
1 2 2
t
X (t) = exp jt
(13.25)
2 n
which we recognize immediately as the characteristic function of a Gaussian random variable with mean , and variance 2 /n. We therefore
N (, 2 /n), i.e. that the sampling disconclude that, in this case, X
tribution of the mean of a random sample of size n from a N (, 2 )
is also a Gaussian distribution with the same mean, but with variance
2 /n.
Example 13.2: SAMPLING DISTRIBUTION: MEAN OF
RANDOM SAMPLE FROM EXPONENTIAL DISTRIBUTION
If X1 , X2 , . . . , Xn is a random sample from an exponential distribution,
dened in Eq 14.69.
E (), nd the distribution of X
Solution:
Again, as with Example 13.1, since (X1 , X2 , . . . , Xn ) is a random sample from the same exponential distribution, we begin by recalling the
characteristic function for the exponential random variable:
1
(13.26)
(t) =
(1 jt)
By virtue of the independence of the random variables, and employing
results about the characteristic function of linear sums of independent
is obtained as
random variables, the characteristic function of X
X (t) =
n
i=1
1
1
=
n
1 j nt
1 j n t
(13.27)
(13.28)
/n
(13.29)
X
2
X
= X
2
X
=
n
(13.30)
(13.31)
Sampling
467
2. What is true for these two example random variables is true in general,
regardless of the underlying distribution (although we have not proved
this formally).
3. The implications are as follows: the expectation of the sample mean
is identical to population mean; and the variance of the sample mean
goes to zero as n . In anticipation of a more detailed discussion
in Chapter 14, we note that the sample mean appears to have desirable
properties that recommend its use in determining the true value of the
unknown population mean.
13.3.2
If the form of the pdf, f (x), underlying a population is unknown, we cannot, in general, determine the full sampling distribution for any random sample
drawn from such a population. Nevertheless, the following information is still
available, regardless of the underlying pdf:
If the random sample (X1 , X2 , . . . , Xn ) comes from a population with mean
is
and variance 2 , but whose full pdf is unknown, then the sample mean X
2
a random variable whose mean X and variance X are given by the following
expressions:
X
2
X
=
#
=
(13.32)
2
n ;
2
N n
n
N 1 ;
(13.34)
X =
n
known as the standard error of the mean.
13.3.3
As shown in the last subsection, if the underlying population pdf is un the mean of a random sample
known, the full sampling distribution of X,
drawn from this population, cannot be determined; but the mean and variance of the sampling distribution are known. Nevertheless, even though the
complete sampling distribution is unknown in general, we know the limiting
distribution (as n ) of a closely related random variable, the standardized
mean dened as:
X
(13.35)
Z=
/ n
468
Random Phenomena
be the mean of
The Central Limit Theorem (CLT):Let X
the random sample (X1 , X2 , . . . , Xn ) taken from a population with
mean, , and (nite) variance 2 . Dene the random variable Z
according to Eq (13.35); then the pdf of Z, f (z), tends to N (0, 1),
a standard normal distribution, in the limit as n .
Remarks:
1. Regardless of the distribution underlying the original population from
which the random sample was drawn, the distribution of the sample
mean approaches a normal distribution as n . In fact, for n as
small as 25 or 30, the normal approximation can be quite good.
2. The random variable n(X
)/ is approximately distributed N (0, 1);
it will therefore be possible to employ the standard normal distribution,
Sampling
0.4
0.655
0.3
Density
469
N(0,1)
0.2
0.1
0.0
-0.66
0
X
1.34
FIGURE 13.1: Sampling distribution for mean lifetime of DLP lamps in Example 13.3
< 5200) = P (0.66 < Z < 1.34)
used to compute P (5100 < X
by chance alone, if the lamps truly came from a collection (population)
with = 5, 133 and = 315.
Solution:
< 5200),
(1) The problem requires that we determine P (5100 < X
but with no specied pdf, we cannot calculate this probability directly.
Nevertheless, knowing that the standardized mean, Z, has a N (0, 1) distribution allows us to compute the approximate probability as follows:
5200 5133
5100 5133
< 5200) = P
<Z<
P (5100 < X
315/ 40
315/ 40
= P (0.66 < Z < 1.34) = 0.655
(13.36)
where the indicated probability has been computed directly from the
computer program MINITAB using the cumulative probability calculation option for a Normal distribution (Calc > Prob Dist > Normal),
with mean = 0, standard deviation = 1, to obtain FZ (1.34) = 0.910 and
FZ (0.66) = 0.255 yielding the indicated result. (See Fig 13.1). Tables
of standard normal probabilities could also be used to obtain this result.
Thus there is a 65.5% chance that the actual average lifetime will be
between 5,100 and 5,200 hours if the lamps truly came from a population
with = 5, 133 and = 315.
(2) Employing the same approximate N (0, 1) distribution for the
< 5000)
standardized mean, we are able to compute the required P (X
as follows:
)
n(X
40(5000 5133)
<
P (X < 5000) = P
315
= P (Z < 2.67) = 0.004
(13.37)
470
Random Phenomena
0.4
Density
0.3
N(0,1)
0.2
0.1
0.00379
0.0
-2.67
0
X
FIGURE 13.2: Sampling distribution for average lifetime of DLP lamps in Example
< 5000) = P (Z < 2.67)
13.3 used to compute P (X
where the probability is obtained directly from MINITAB again using
the cumulative probability calculation option for a Normal distribution
(Calc > Prob Dist > Normal), with mean = 0, standard deviation =
1, to obtain FZ (2.67) = 0.004 (see Fig 13.2).
And now, because the probability of obtaining by chance alone
a sample of 40 lamps with such a low average lifetime (< 5, 000 hours)
from a population purported to have a much higher average lifetime, is
so small, it is very doubtful that this sample came from the postulated
population. It appears more likely that the result from this sample is
more representative of the true lifetimes of the lamps. If true, then the
practical implication is that there is reason to doubt the claim that
these DLP projector lamps truly merit the purportedlong lifetime
characterization.
13.3.4
Unknown
Sampling
471
S/ n
(13.39)
Remarks:
1. We encountered this result in Chapter 9 (section 9.3.5) during our discussion of probability models for continuous random variables.
2. This result is somewhat more general than the Central Limit Theorem
because it does not require knowledge of 2 ; conversely, it is less general in that it requires the normality assumption for the underlying
distribution, which the CLT does not require.
3. The result holds exactly for any n: under the normality assumption, the
pdf of T is exactly the t-distribution regardless of sample size n; the
CLT on the other hand prescribes a limiting distribution as n .
4. As noted earlier, the t-distribution approaches the standard normal distribution as (hence, n) .
5. The normality assumption is not too severe, however; when samples
472
Random Phenomena
are from non-normal populations, the distribution of the T statistic is
still fairly close to the Students t-distribution.
Let us illustrate the application of this result with the following example.
Example 13.4: MEAN DIAMETER OF BALL BEARINGS
A manufacturer of low precision 10 mm diameter ball bearings periodically takes samples and measures them to conrm that the manufacturing process is still on target. A random sample of 20 ball bearings
resulted in diameter measurements with an average of x
= 9.86 mm,
and a standard deviation s = 1.01 mm. Postulating that the random
sample X1 , X2 , . . . , X20 is from a Gaussian distribution with = 10,
will fall to either side of
nd the probability that any sample mean, X,
the postulated mean by the observed amount or more, by chance alone.
9.86) or P (X
10.14).
i.e. P (X
Solution:
Had the population variance, 2 , been given, we could have obtained
precisely (as another normal distrithe sampling distribution for X
bution with = 10 and variance 2 /20); these results could have
been used directly to compute the required probabilities. However,
since the population variance is not given, the required probability
9.86) + P (X
10.14) is determined using the T statistic:
P (X
|
n|X
20|9.86 10|
P
S
1.01
P (T 0.62) + P (T 0.62)
0.542
9.86) + P (X
10.14)
P (X
(13.40)
Again, the probabilities are obtained directly from MINITAB using the
cumulative probability computation option for the t-distribution, with
= 19 degrees of freedom to give FT (0.62) = 0.271, and by symmetry,
P (T 0.62) = 0.271 to obtain the result in Eq (13.40). (See Fig 13.3.)
The implication is that, under the postulate that the ball bearings
are truly 10 mm in diameter, there is a fairly substantial 54% chance
that the observed sample average misses the target of 10 mm to the left
(comes in as 9.86 or less) or to the right by the same amount (comes in as
10.14 or more) purely at random. In other words, by pure chance alone,
one would expect to see this sort of observed deviation of the sample
mean from the true postulated (target) mean diameter value of 10 mm,
more than half the time. The inclination therefore is to conclude that
there is no evidence in this sample data that the process is o-target.
Again, as with the previous example, this one also provides a preview of what
is ahead in Chapters 14 and 15.
Sampling
473
0.4
T, df=19
Density
0.3
0.2
0.1
0.271
0.0
0.271
-0.620
0
X
0.62
FIGURE 13.3: Sampling distribution of the mean diameter of ball bearings in Example
10| 0.14) = P (|T | 0.62)
13.4 used to compute P (|X
13.4
(13.42)
(n 1)S 2
2
(13.43)
474
Random Phenomena
has a 2 (n 1) distribution. Such a distribution can be used to compute
probabilities concerning the random variable, S 2 . Unfortunately, nothing
so explicit can be said about sampling distributions of variances for
random samples drawn from non-normal populations.
S12
S22
(13.44)
Sampling
475
0.07
0.06
f(x)
0.05
F df=19
0.04
0.03
0.02
0.01
0.00
0.199
0
23.93
FIGURE 13.4: Sampling distribution for the variance of ball bearing diameters in
Example 13.5 used to compute P (S 1.01) = P (C 23.93)
chance of obtaining, purely at random, sample variances that are 1.01
or higher, even when the process is operating as designed.
Example 13.6: VARIANCE OF BALL BEARINGS DIAMETER MEASUREMENTS: TWO RANDOM SAMPLES
A second random sample of 20 ball bearings taken a month after the
sample examined in Examples 13.3 and 13.4 yielded an average measured diameter of x
2 = 10.03 mm, and a standard deviation s2 = 0.85
mm. Find the probability that the process operation remains essentially
unchanged, in terms of the observed sample standard deviation, even
though the newly observed sample standard deviation is less than the
value observed a month earlier. All assumptions in Example 13.4 hold.
Solution:
In this case, we are concerned with comparing two sample variances,
S12 from the previous month (with the specic value of (1.01)2 ) and S22 ,
the most recent, with specic value (0.85)2 . Since the real question is
not whether one value is greater than the other, but whether the two
sample variances are equal or not, we use the F -distribution with degrees
of freedom (19,19) to obtain the probability that F (1.01)2 /(0.85)2 )
(or, vice-versa, that F (0.85)2 /(1.01)2 ). The required probability is
obtained as:
P (F 1.41) + P (F 0.709) = 0.460
The indicated probabilities are, once again, obtained directly from
MINITAB for the F -distribution with 1 = 2 = 19 degrees of freedom; (See Fig 13.5.) The implication is that there is almost a 50%
chance that the observed dierence between the two sample variances
476
Random Phenomena
1.0
Density
0.8
0.6
F, df1=19, df2=19
0.230
0.4
0.2
0.230
0.0
0.709
1.41
FIGURE 13.5: Sampling distribution for the two variances of ball bearing diameters
in Example 13.6 used to compute P (F 1.41) + P (F 0.709)
will occur purely at random. We conclude, therefore, that there does
not appear to be any evidence that the process operation has changed
in the past month since the last sample was taken.
With these concepts in place, we are now in a position to discuss the two
aspects of statistical inference, beginning with Estimation in the next chapter,
followed by Hypothesis Testing in Chapter 15.
13.5
Sampling
477
determining the distribution of such statistics. In particular, since in inductive statistics (as we shall soon discover in Chapters 14 and 15), the two most
important statistics are the sample mean and sample variance, the distributions of these quantities were characterized under various conditions, giving
rise to resultssome general, others specic only to normal populationsthat
are used extensively in the next two chapters. Of all the general results, the
one used most frequently arises from the Central Limit Theorem, through
which we know that, regardless of the underlying distribution, as the sample
size tends to innity, the sampling distribution of the sample mean tends to
the Gaussian distribution. This collection of results provides the foundation
for the next two chapters.
Here are some of the main points of the chapter again.
Sampling is concerned with the probabilistic characterization of nite
size samples drawn from a population; it is the statistical analog to
the probability problem of characterizing individual observations of a
random variable with a pdf.
The central concepts in sampling are:
The random sample: in principle, n independent random variables
drawn from a population in such a way that each one has an equal
chance of being drawn; the mathematical consequence is that if
f (xi ) is the pdf for the ith random variable, then the joint pdf of
the random sample is a product of the contributing pdfs;
The statistic: a function of one or more random variables that
does not contain an unknown population parameter.
The sampling distribution: the probability distribution of a statistic of interest; its determination is signicantly facilitated if the
statistic is a function of a random sample.
As a consequence of the central limit theorem, we have the general result
that as n , the distribution of the mean of a random sample drawn
from any population with known mean and variance, is Gaussian, greatly
enabling the probabilistic analysis of means of large samples.
REVIEW QUESTIONS
1. What is sampling?
2. As presented in Section 13.1 what are the three central concepts of sampling
theory?
478
Random Phenomena
3. What is a random sample? And what is the mathematical implication of the statement that X1 , X2 , . . . , Xn constitutes a random sample from a distribution that has
a pdf f (x)?
4. What is a statistic?
5. What is a sampling distribution?
6. What is the primary utility of a statistic and its distribution?
7. How is the discussion in Chapter 6 helpful in determining sampling distributions?
8. What is the sampling distribution of a linear combination of n independent Gaussian random variables with identical pdfs?
9. What is the sampling distribution of a sum of n independent 2 (r) random variables with identical pdfs?
is the mean of a random sample of size n from a population with mean X
10. If X
2
and what is V ar(X)?
and variance X
, what is E(X)
11. What is the standard error of the mean?
12. What is the central limit theorem as stated in Section 13.3? What are its implications in sampling theory?
13. In sampling theory, under what circumstances will the t-distribution be used
instead of the standard normal distribution?
14. What is the sampling distribution of the variance of a random sample of size n
drawn from a Gaussian distribution?
15. What is the sampling distribution of the ratio of the variances of two sets
of independent random samples of sizes n1 and n2 each drawn from two normal
populations having the same variance?
EXERCISES
Section 13.1
13.1 Given X1 , X2 , . . . , Xn , a random sample from a population with mean and
variance 2 , both unknown, determine which of the following functions of this random sample is a statistic and which is not.
*
Xi )1/n ;
(i) Y1 = ( n
i=1
n
(ii) Y2 = i=1 (Xi )2 ;
n
(iii) Y3 = n
i=1 i Xi ;
i=1 i = 1;
Sampling
479
Xi
(iv) Y4 = n
i=1 .
If the population mean is specied, how will this change your answer?
13.2 Given X1 , X2 , . . . , Xn , a random sample from a population with mean and
variance 2 , both unknown, dene the following statistic:
n
= 1
X
Xi
n i=1
as the sample mean. Determine which of the following functions of the random
variable are
which are not:
statisticsand
2
(i) Y1 = n
i=1 (Xi X) /(n 1);
2
(ii) Y2 = n
i=1 (Xi ) /n;
k /(n 1); k > 0;
|X
X|
(iii) Y3 = n
i
i=1
(iv) Y4 = n
i=1 (Xi X)/.
13.3 For each of the following distributions, given the population mean , and
variance 2 , derive the appropriate expressions for obtaining the actual pdf parameters (, , or n, p) in terms of and 2 : (i) Gamma(, ); (ii) Beta(, ); (iii)
Binomial(n, p).
Section 13.2
13.4 Given n mutually stochastically independent random variables, X1 , X2 , . . . , Xn ,
with respective pdfs N (1 , 12 ), N (2 , 22 ), . . . N (n , n2 ):
(i) Determine the distribution of the statistic:
Y = k1 X1 + k2 X2 + + kn Xn
where k1 , k2 , . . . , kn are real constants; and show that it is a Gaussian distribution,
N (y , y2 ) where
y
k1 1 + k2 2 + + kn n
y2
i=1
n
ki2
i=1
n
1
Xi
n i=1
480
Random Phenomena
are given by
y
y2
2 /n
and hence establish the results given in Eqs (13.17) and (13.18) in the text.
13.6 Given a random sample X1 , X2 , . . . , Xn , from a Gaussian N (, 2 ) distribution;
show that the random variable
2
n
Xi
Y =
i=1
has a 2 (n) distribution.
13.7 Given n mutually stochastically independent random variables, X1 , X2 , . . . , Xn ,
with respective pdfs 2 (r1 ), 2 (r2 ), , 2 (rn ); show that the random variable
Y = X1 + X2 + + Xn
has a 2 (r) distribution with degrees of freedom,
r = r1 + r2 + + rn .
13.8 Given a random sample X1 , X2 , . . . , Xn from a Poisson P() distribution,
determine the sampling distribution of the random variable dened as
Y =
n
1
Xi
n i=1
n
Xi
i=1
Section 13.3
13.10 Given that X1 , X2 , . . . , Xn constitute a random sample from a population
with mean , and variance 2 , dene two statistics representing the sample mean
as follows:
n
1
Xi
n i=1
n
i=1
i Xi ; with
(13.46)
n
i = 1
(13.47)
i=1
where the rst is a regular mean and the second is a weighted mean. Show that
= and also that E(X)
= ; but that V ar(X)
V ar(X).
E(X)
13.11 If X1 , X2 , . . . , Xn is a random sample from a Poisson P() distribution, nd
the sample mean dened in Eq 14.69. (Hint: See Example 6.1
the distribution of X,
Sampling
481
and V ar(X).
nd the distribution of X
and V ar(X).
13.13 If X1 , X2 , . . . , Xn is a random sample from a Lognormal L(, ) distribution,
(i) Find the distribution of the geometric mean
g =
X
n
1/n
Xi
(13.48)
i=1
g ) and V ar(ln X
g ) .
(ii) Determine E(ln X
13.14 Given a random sample of 10 observations drawn from a Gaussian population
with mean 100, and variance 25, compute the following probabilities regarding the
sample mean, X:
100); (ii) P (X
100); (iii) P (X
104.5); (iv) P (96.9 X
103.1);
(i) P (X
101.6). Will the sample size make a dierence in the distribution
(v) P (98.4 X
used to compute the probabilities?
13.15 Refer to Exercise 13.14. This time, the population variance is not given;
instead, the sample variance was obtained as 24.7 from the 10 observations.
(i) Compute the ve probabilities.
(ii) Recompute the probabilities if the sample size increased to 30 but the sample
variance remained the same.
13.16 A sample of size n is drawn from a large population with mean and variance
2 but unknown distribution;
(i) Determine the mean and variance of the sample mean when n = 10; = 50; 2 =
20;
(ii) Determine the mean and variance of the sample mean when n = 20; = 50; 2 =
20;
(iii) Determine the probability that a sample mean obtained from a sample of size
n = 50 will not deviate from the population mean by more than 3. State any
assumption you may need to make in answering this question.
13.17 A random sample of size n = 50 is taken from a large population with mean
15 and variance 4, but unknown distribution.
(i) What is the standard deviation X of the sample mean?
(ii) If the sample size were reduced by 50%, what will be the new standard deviation
X of the sample mean?
(iii) To reduce the standard deviation to 50% of the value in (i), what sample size
will be needed?
Section 13.4
13.18 The variance of a sample of size n = 20 drawn from a normal population with
mean 100 and variance 10 was obtained as s2 = 9.5.
(i) Determine the probability that S 2 10.
482
Random Phenomena
APPLICATION PROBLEMS
13.22 The following data set, from a study by Lucas (1985)1 , shows the number of
accidents occurring per quarter (three months) at a DuPont company facility, over
a 10-year period. The data set has been partitioned into two periods: Period I is the
rst ve-year period of the study; Period II, the second ve-year period.
5
4
2
5
6
Period I
5 10 8
5 7
3
8 6
9
6 5 10
3 3 10
3
1
7
1
4
Period II
4 2 0
3 2 2
7 1 4
2 2 1
4 4 4
(i) Why is a Poisson pdf a reasonable model for representing this data?
(ii) Treat the entire 10-year data as a single block and the data shown as specic
observations {xi }40
i=1 of a random sample Xi ; i = 1, 2, . . . , 40, from a single Poisson
population with unknown parameter . Obtain a numerical value for the sample
use it as an estimate of the unknown population parameter, , to produce
mean, X;
1 Lucas
Sampling
483
484
Random Phenomena
sampling distribution for the sampling mean in terms of the unknown population
parameter, , and sample size, n.
(ii) The company claims that its true mean waiting time between safety incidents
is 40 days. From the sampling distribution and the specic value x
obtained in (i),
x
determine P (X
| = 40).
(iii) An independent safety auditor estimates that from the companys records, operating procedures, and other performance characteristics, the true mean waiting time
x
between safety incidents is more likely to be 30 days. Determine P (X
| = 30).
Compare this probability to the one obtained in (ii) and comment on which postulated value for the true mean waiting time is more believable, = 40 as claimed by
the company, or = 30 as claimed by the independent safety auditor.
13.27 Refer to Application Problem 13.26. From the pdf of the exponential E (40)
random variable, determine P (X 10). Recompute P (X 10) using the pdf for
the exponential E (30) random variable. Which of these results is more consistent
with the data set S1 ? Comment on what this implies about the more likely value
for the population parameter.
13.28 Refer to the data in Table 1.1 in Chapter 1, which shows 50 samples of the
random variables YA and YB , yields obtained from each of two competing chemical
processes. Specic values were obtained for the sample means in Chapter 1 as yA =
75.52 and yB = 72.47. Consider the proposition that the YA data is from a normal
population with the distribution N (75.5, 1.52 ).
(i) Determine the sampling distribution for YA and from it compute P (75.0 YA
76.0).
(ii) Consider the proposition that there is no dierence between the yields obtainable
from each process. If this is so, then the YB data should also be from the same normal
population as that specied for YA . Using the sampling distribution obtained in (i),
compute P (YB 72.47). Comment on what this implies about the plausibility of
this proposition.
(iii) Using the sampling distribution in (i), determine the value of 0 for which
P (YA 0 ) = 0.05
Compare this value with the computed value, yB = 72.47. What does this result
imply about the plausibility of the proposition that both data sets come from the
same population with the same distribution?
13.29 Refer to Application Problem 13.28; this time consider an alternative proposition that in fact the YB is from a normal population with the distribution
N (72.5, 2.52 ). Determine the sampling distribution for YB and from it compute
P (YB 72.47) as well as P (72.0 YB 73.0). Comment on what these results
imply about the plausibility of this alternative proposition.
13.30 A manufacturer of 10 mm diameter ball bearings uses a process which, when
operating properly, is calibrated to produce ball bearings with mean diameter =
10.00 mm and standard deviation = 0.10 mm. In order to evaluate the performance
of the process at a particular point in time, a random sample of n ball bearings is to
be taken and the diameters determined in a quality control laboratory. Determine
Sampling
485
a center line representing = 10.00; an upper limit line set at (10 + 3/ ns ) and
71
72
75
77
65
84
69
78
73
69
66
70
68
77
71
73
74
65
68
75
486
Random Phenomena
Consider the postulate that these data came from the same normal population
with mean = 70 but whose variance is unspecied.
(i) If this is true, what is the probability that the mean of any random sample of
trainee scores will exceed 74? Interpret this result in light of individual sample means
of Method A and Method B scores. How plausible is this postulate?
(ii) Now consider an alternative postulate that scores obtained by trainees instructed
by Method B are actually drawn from a normal population with mean
B = 75.
(75 + 2sB / 10)] and
Determine the limits of the interval [(75 2sB / 10) X
the probability that the mean score of any other random sample of 10 from this
population of trainees instructed by Method B will fall into this interval. Where
A lie in relation to this interval?
does the value obtained for the sample mean X
Discuss the implications of these results on the plausibility of this new postulate.
Chapter 14
Estimation
488
488
489
490
490
491
492
493
493
496
499
501
503
506
506
507
508
509
512
514
514
515
518
518
519
520
521
521
521
522
524
524
525
525
526
527
528
530
537
487
488
Random Phenomena
With the sampling theory foundation now rmly in place, we are nally in a
position to begin building the two-tiered statistical inference edice, starting
with the rst tier, Estimation, in this chapter, and nishing with Hypothesis
Testing in the next chapter. The focus in the rst half of this chapter is on how
to determine, from incomplete information in sample data, unknown population parameter values needed to complete the characterization of the random
variable with the pdf f (x|). The focus in the second complementary half is
on how to quantify the unavoidable uncertainty arising from the variability
in nite samples. Just as estimation theory relies on sampling theory, so does
the theory of hypothesis testing rely on both estimation theory and sampling
theory; the material in this chapter therefore also serves as an important link
in the statistical inference chain.
14.1
14.1.1
Introductory Concepts
An Illustration
Consider an opinion pollster who states that 75% of undergraduate chemical engineering students in the United States prefer closed-book exams to
opened-book ones, and adds a margin of error of 8.5% to this statement.
It is instructive to begin our discussion by looking into how such information
is obtained and how such statements are really meant to be interpreted.
First, in the strictest possible sense of the formal language of statistics, the
population of concern is the opinion of all undergraduate chemical engineering students in the United States. However, in this case, many oftenbut
somewhat impreciselyconsider the population as the students themselves
(perhaps because of how this aligns with the more prevalent sociological concept of population). Either way, observe that we are dealing with a technically nite population (there is, after all, an actual, nite and countable
number of individuals and their opinions). Practically, however, the size of
this population is quite large and it is dicult (and expensive) to obtain the
opinion of every single individual in this group. The pollster simply contacts
a pre-determined number of subjects selected at random from this group, and
asks for their individual opinions regarding the issue at stake: preference for
closed-book versus opened-book exams. The premise is that there is a true,
but unknown, proportion, c , that prefers closed-book exams; and that results
obtained from a sample of size n can be used to deduce what c is likely to be.
Next, suppose that out of 100 respondents, 75 indicated a preference for closedbook exams with the remaining 25 opting for the only other alternative. The
main conclusion stated above therefore seems intuitively reasonable since
indeed this sample shows 75 out of 100 expressing a preference for closed-book
exams. But we know that sampling a dierent set of 100 students will quite
Estimation
489
14.1.2
Estimation is the process by which information about the value of a population parameter (such as c in the opinion survey above) is extracted from
sample data. Because estimation theory relies heavily on sampling theory, the
samples used to provide population parameter estimates are required to be
random samples drawn from the population of interest. As we show shortly,
this assumption signicantly simplies estimation.
There are two aspects of estimation:
1. Point Estimation: the process for obtaining a single best value for a
population parameter;
2. Interval Estimation: the process by which one obtains a range of values
that will include the true parameter, along with an appended degree of
condence.
Thus, in terms of the opinion poll illustration above, the point estimate of c
is given as c = 0.75. We have introduced the hat notation . to dierentiate
an estimate from the true but unknown parameter). On the other hand, the
interval estimate, will be rendered as c = 0.75 0.085, or 0.665 < c < 0.835,
to which should be appended with 95% condence (even though this latter
appendage is usually missing in statements made for the public press).
The problem at hand may now be formulated as follows: A random variable X has a pdf f (x; ), whose form is known but the parameters it contains,
, are unknown; to be able to analyze X properly, f (x; ) needs to be completely specied, in the sense that the parameter set, , must be determined.
This is done by inferring the value of the parameter vector, , from sample
data, specically, from {x1 , x2 , . . . , xn }, specic values of a random sample,
X1 , X2 , . . . , Xn , drawn from the population with the pdf f (x; ).
The following are four key concepts that are central to estimation theory:
1. Estimator : Any statistic U (X1 , X2 , . . . , Xn ) used for estimating the unknown quantity , or g(), a function thereof;
2. Point Estimate: Actual observed value u(x1 , x2 , . . . , xn ) of the estimator
using specic observations x1 , x2 , . . . , xn ;
490
Random Phenomena
14.2
By denition, estimators are statistics used to estimate unknown population parameters, , from actual observations. Before answering the question:
how are estimators (and estimates) determined? we wish to consider rst how
to evaluate estimators. In particular, we will be concerned with what makes
a good estimator, and what properties are desirable for estimators.
14.2.1
Unbiasedness
(14.1)
(14.2)
n
i Xi
(14.3)
i=1
i = 1
(14.4)
Estimation
491
0.4
f(x)
0.3
U1
0.2
0.1
U2
0.0
-15
-10
-5
0
X
10
15
FIGURE 14.1: Sampling distribution for the two estimators U1 and U2 : U1 is the more
ecient estimator because of its smaller variance
Solution:
(1) By denition of the expectation, we obtain from Eq (14.2):
=
E[X]
n
1
E[Xi ] = ,
n i=1
(14.5)
n
i E[Xi ] =
i=1
provided that
14.2.2
n
i=1
n
i =
(14.6)
i=1
i = 1 as required.
Eciency
If U1 (X1 , X2 , . . . , Xn ) and U2 (X1 , X2 , . . . , Xn ) are both unbiased estimators for g(), then U1 is said to be a more ecient estimator if
V ar(U1 ) < V ar(U2 )
(14.7)
See Fig 14.1. The concept of eciency roughly translates as follows: because
of uncertainty, estimates produced by U1 and U2 will vary around the true
value; however, estimates produced by U1 , the more ecient estimator will
cluster more tightly around the true value than estimates produced by U2 . To
understand why this implies greater eciency, consider a symmetric interval
of arbitrary width, , around the true value g(). Out of 100 estimates
492
Random Phenomena
V ar[U (n)]
nE
g()
ln f (x;)
(2 ,
(14.8)
generally known as the Cramer-Rao inequality, with the quantity on the RHS
known as the Cramer-Rao (C-R) lower bound. The practical implication of
this result is that no unbiased estimator U (n) can have variance lower than
the C-R lower bound. An estimator with the minimum variance of all unbiased estimators (whether it achieves the C-R lower bound or not) is called a
Minimum Variance Unbiased Estimator (MVUE).
is more ecient than X
; in fact, it
Of the estimators in Example 14.1, X
is the most ecient of all unbiased estimators of .
can be shown that X
14.2.3
Consistency
By now, we are well aware that samples are nite subsets of populations
from which they are drawn; and, as a result of unavoidable sampling variability, specic estimates obtained from various samples from the same population
will not exactly equal the true values of the unknown population parameters
they are supposed to estimate. Nevertheless, it would be desirable that as the
sample size increases, the resulting estimates will become progressively closer
to the true parameter value, until the two ultimately coincide as the sample
size becomes innite.
Mathematically, a sequence of estimators, Un (X), n = 1, 2, . . ., where n is
the sample size, is said to be a consistent estimator of g() if
lim P (|Un (X) g()| < ) = 1
(14.9)
for every > 0. According to this denition, a consistent sequence of estimators will produce an estimate suciently close to the true parameter value if
the sample size is large enough.
Recall the use of Chebyshevs inequality in Chapter 8 to establish the
(weak) law of large numbers, specically: that the relative frequency of success (the number of successes observed per n trials) approaches the actual
Estimation
493
14.3
14.3.1
Method of Moments
Mk =
(14.11)
(14.12)
(14.13)
Estimate , the unknown parameter from this random sample using the
494
Random Phenomena
method of moments.
Solution:
Since there is only one parameter to be estimated, only one moment
equation is required. Let us therefore choose the rst moment, which,
by denition, is
(14.15)
m1 = = E[X] =
The sample analog of this theoretical moment is
n
= 1
M1 = X
Xi
n i=1
(14.16)
(14.17)
where the hat has been introduced to indicate an estimate and distinguish it from its true but unknown value.
Thus, the method of moments estimator for the exponential parameter is:
n
1
Xi
(14.18)
UM M (X1 , X2 , . . . , Xn ) =
n i=1
When specic data sets are obtained, specic estimates of the unknown parameters are obtained from the estimators by substituting the observations
x1 , x2 , . . . , xn for the random variables X1 , X2 , . . . , Xn , as illustrated in the
following examples.
Example 14.3: SPECIFIC METHOD OF MOMENT ESTIMATES: EXPONENTIAL DISTRIBUTION
The waiting time (in days) until the occurrence of a recordable safety
incident in a certain companys manufacturing site is known to be an exponential random variable with an unknown parameter . In an attempt
to improve its safety record, the company embarked on a safety performance characterization program which involves, among other things,
tracking the time in between recordable safety incidents.
(1) During the rst year of the program, the following data set was
obtained:
(14.19)
S1 = {16, 1, 9, 34, 63, 44, 1, 63, 41, 29}
which translates as follows: 16 days elapsed before the rst recordable
event occurred; 1 day thereafter the second event occurred; the third
occurred 9 days after, and the fourth, 34 days after, etc. From this data
record, obtain a method of moments estimate of the parameter , the
mean time between safety incidents.
(2) The data record for the second year of the program is:
S2 = {35, 26, 16, 23, 54, 13, 100, 1, 30, 31}
(14.20)
Estimation
495
(14.21)
(14.22)
At this point, it is not certain whether the dierence between the two
estimates is due primarily to random variability in the sample or not
(the sample size, n = 10 is quite small). Is it possible that a truly
signicant change has occurred in the companys safety performance
and this is being reected in the slight improvement in the average
waiting time to the occurrence of recordable incidents observed in the
second-year data? Hypothesis testing, the second aspect of statistical
inference, provides tools for answering such questions objectively.
=
=
E[X] =
2
(14.23)
2
E[X ] = +
(14.24)
where this last equation merely recalls the fact that 2 = E[X 2 ]
(E[X])2 . The sample analogs to these theoretical moment equations
are:
M1
n
1
Xi = X
n i=1
(14.25)
M2
n
1 2
X
n i=1 i
(14.26)
496
Random Phenomena
And now, by equating corresponding theoretical and sample moment
equations and solving for the unknown parameters, we obtain, rst for
, the estimator
n
1
Xi = X
(14.27)
U1 =
n i=1
and for , the estimator
+
,
n
, 1
2
2
X
U2 =
(X)
n i=1 i
(14.28)
Once again, given any specic observations, we obtain the actual estimates corresponding to the observations by substituting the data
{x1 , x2 , . . . , xn } into these estimator equations.
Remarks:
1. Method of moments estimators are not unique. In Example 14.2, we
could have used the second moment instead of the rst, so that the
theoretical moment equation would have been
E[X 2 ] = V ar[X] + (E[X])2 = 2 2
(14.29)
As such, upon equating this to the sample moment and solving, we would
have obtained:
+
,
n
, 1
=X2
(14.30)
2n i=1 i
prescribed in Eq (14.17).
which, in general, will not equal the X
2. Thus, we cannot really talk about the method of moments estimator,
not even a method of moments estimator. Note that we could just as
easily have based this method on moments about the mean instead of
the ordinary moments (about the origin).
3. Nevertheless, these estimators are consistent (under most reasonable
conditions) in the sense that empirical moments converge (in probability) to the corresponding theoretical moments.
14.3.2
Maximum Likelihood
Estimation
497
e xi
xi !
(14.31)
x1 !
x2 !
xn !
(14.32)
which simplies to
n
en i=1 xi
f (x1 , x2 , . . . , xn |) =
x1 !x2 ! xn !
(14.33)
L() =
en i=1 xi
x1 !x2 ! xn !
(14.34)
498
Random Phenomena
answers to, for all conceivable values of . It now seems entirely reasonable
that maximizes the likelihood of observing the
to seek a value of (say )
values (x1 , x2 , . . . , xn ) and use it as an estimate of the unknown population
parameter. Such an estimate is known as the maximum likelihood estimate.
The interpretation is that of all possible values one could entertain for the
yields the highest possible
unknown parameter , this particular value, ,
is specied
probability of obtaining the observations (x1 , x2 , . . . , xn ), when
However, the algebra involved is considerably simplied by the fact that L()
and () = ln{L()}, the so-called log-likelihood function, have the same max is often determined by maximizing the log-likelihood function
imum. Thus,
instead. In this specic case, we obtain, from Eq (14.34),
() = ln{L()} = n +
n
ln
xi
i=1
n
ln xi !
(14.36)
i=1
Dierentiating with respect to and setting the result equal to zero yields:
()
= n +
n
i=1
xi
=0
(14.37)
i=1
xi
(14.38)
U (X1 , X2 , . . . , Xn ) =
(14.39)
Estimation
499
Maximum Likelihood Estimate: Given X1 , X2 , . . . , Xn , a random sample from a population whose pdf (continuous or discrete),
f (x; ), contains a vector of unknown characteristic parameters,
; the likelihood function for this sample is given by:
L() = f (x1 ; )f (x2 ; ) f (xn ; )
(14.40)
the joint pdf of X1 , X2 , . . . , Xn written as a function of the un that maximizes L() is known as the maxiknown . The value
maximizes
mum likelihood estimate (MLE) of . (The same value
() = ln{L()}.)
This general result is now illustrated below with several specic examples.
Example 14.5: MAXIMUM LIKELIHOOD ESTIMATE OF
AN EXPONENTIAL DISTRIBUTION PARAMETER
Let X1 , X2 , . . . , Xn be a random sample from an exponential population
with pdf
1
(14.41)
f (x; ) = ex/
L() =
1 ( ni=1 xi )/
=
e
(14.42)
n
From here, we easily obtain:
() = ln L() = n ln
so that
n
1
xi
i=1
n
n
1
()
=
+ 2
xi = 0
i=1
(14.43)
(14.44)
500
Random Phenomena
n
i=1 (xi
2 2
)2
(14.47)
(14.48)
2 2
n
xi n = 0
(14.49)
i=1
n
(xi )2
n
= + i=1 3
=0
(14.50)
=
n i=1
and
which, when introduced into Eq (14.50) and simplied, gives the second and
nal result:
) n
)2
i=1 (xi
=
(14.52)
n
Observe that the MLE of in Eq (14.51) is the same as the sample mean, but
the MLE of in Eq (14.52) is not the same as the sample standard deviation.
For large n, of course, the dierence becomes negligible, but this illustrates
an important point: because the sample standard deviation
&
n
)2
i=1 (xi x
(14.53)
s=
n1
Estimation
501
(where x is the sample mean) is unbiased for , this implies that the MLE for
is biased.
Important Characteristics of MLEs
Of the many characteristics of maximum likelihood estimates, the following
are the two most important we wish to highlight:
Thus, according to the rst property, from the MLE of the sample standard
deviation in Eq (14.52), we immediately know that the MLE of the variance
will be given by
n
(xi
)2
2
(14.54)
= i=1
n
The second property makes large sample MLEs very attractive.
The following are a few more examples of MLEs.
Example 14.6: MLE OF A BINOMIAL/BERNOULLI PROBABILITY OF SUCCESS
Let X1 , X2 , . . . , Xn be the result obtained from n Bernuolli trials where
the probability of success is p, i.e.
1; with probability p
(14.55)
Xi =
0; with probability q = 1 p
If
the random variable, X, is the total number of successes in n trials,
i Xi , obtain a MLE for p.
Solution:
There are several ways of approaching this problem. The rst approach
is direct, from the perspective of the Bernoulli random variable; the
second makes use of the fact that the pdf of a sum of n Bernoulli random
variables is a Binomial random variable.
To approach this directly, we recall from Chapter 8, the compact
pdf for the Bernoulli random variable
f (x) = pIS (1 p)IF
where the success indicator, IS , is dened as:
1; for X = 1
IS =
0; for X = 0
(14.56)
(14.57)
502
Random Phenomena
and its complement, the failure indicator, IF
1; for X = 0
IF =
0; for X = 1
(14.58)
n
i=1
Xi
(1 p)n
n
i=1
Xi
(14.61)
(14.62)
i=1
=0
p
p
(1 p)
n
n
(n i=1 Xi )
i=1 Xi
=
p
(1 p)
which, when solved for p, yields the result:
n
i=1 Xi
p =
n
(14.63)
(14.64)
X
n
(14.67)
exactly the same result as in Eq (14.64) since X = n
i=1 Xi .
Thus, the MLE for p is the total number of successes in n trials
divided by the total number of trials, as one would expect.
Estimation
503
14.4
and
2
S =
2
X)
n1
i=1 (Xi
(14.70)
are both unbiased for the population mean , and variance 2 , respectively. They are thus both accurate. For any specic set of observations,
(x1 , x2 , . . . , xn ), the computed
1
xi
n i=1
n
x
=
(14.71)
(14.72)
504
Random Phenomena
0.4
f(z)
0.3
0.2
0.1
D
D
0.0
-z D
0
Z
z D
FIGURE 14.2: Two-sided tail area probabilities of /2 for the standard normal sampling distribution
P z/2 <
< z/2 = (1 )
(14.73)
/ n
where z/2 is that value of z with a tail area probability of /2, as illustrated
in Fig 14.2. We may therefore state that with probability (1), the proximity
to is characterized by:
of X
|
|X
< z/2
/ n
(14.74)
For the specic case where is chosen as 0.05, then, z/2 , the value of the
standard normal random variable, z, for which the tail area probability is
0.025 is 1.96. As a result,
|
|X
< 1.96
(14.75)
/ n
implying that
| < 1.96/ n
|X
(14.76)
=X
n
(14.77)
Estimation
505
(14.79)
2
p(1 p)
X
=
n2
n
p =
n
(14.80)
(14.81)
506
Random Phenomena
with p estimated as 0.75. Assuming that n = 100 is suciently large
so that the standard normal approximation for {(
p p)/p } holds in
this case, we obtain immediately from Eq (14.77) that, with probability
0.95,
(14.82)
c = p = 0.75 1.96 0.75 0.25/10 = 0.75 0.085.
This is how the surveys margin of error quoted at the beginning of
the chapter was determined.
We may now observe that in adding a measure of precision to point estimates, the net result has appeared in the form of an interval within which
the true parameter is expected to lie, with a certain pre-specied probability.
This motivates the concept of interval estimation.
14.5
14.5.1
Interval Estimates
General Principles
X z/2
< < X + z/2
(14.83)
n
n
gives the interval within which we expect the true value to lie, with probability (1 ). This provides a dierent way of estimating an approach that
combines, in one self-contained statement, the estimate and a probabilistic
measure of its precision; it is called an interval estimate.
There are, therefore, two main aspects of an interval estimate:
1. The boundaries of the interval; and
2. The associated probability (usually termed the degree of condence)
that the specied random interval will contain the unknown parameter.
The interval estimators are the two statistics UL and UR used to determine the
left and right boundaries respectively; the sampling distribution of the point
Estimation
507
14.5.2
Z=
(14.84)
/ n
has a standard normal, N (0, 1), distribution. From this sampling distribution
we now obtain the following probability statement:
for the statistic X,
X
< z/2 = (1 )
P z/2 <
(14.85)
/ n
as we did earlier in Eq (21.74), which converts to the
interval shown in Eq
z/2 (/ n)] contains with
(14.83), implying nally that the interval [X
probability (1). Specically for = 0.05, the commonly used default value,
z/2 = 1.96, so that the resulting interval,
1.96(/ n),
CI = X
(14.86)
is therefore the 95% condence interval for , the mean of a normal population
estimated from a random sample of size n.
The general procedure for obtaining interval estimates for the mean of a
normal population is therefore as follows:
1. Determine the point estimator (the sample average) and its distribution
(a standard normal for the normalized average, which will include , the
population variance, assumed known);
2. Determine the end points of an interval that will contain the unknown
mean , with specied probability (typically (1 ), with = 0.05).
508
Random Phenomena
PROCESS YIELDS
Given that the result of a series of 50 experiments performed on the
chemical processes discussed in Chapter 1 constitute random samples
from the respective populations for the yields, YA and YB , assume that
these are two normal populations and obtain 95% condence interval
estimates for the population means A and B , given the respective
population standard deviations as A = 1.5 and B = 2.5.
Solution:
From the supplied data, we obtain the sample averages as:
yA = 75.52; yB = 72.47
(14.87)
and given the standard deviations and n = 50, we obtain the following
interval estimates:
(14.89)
The implication is that, with 95% condence, the mean yield for each
process is characterized as follows: for process A, 75.10 < A < 75.94;
and for process B, 71.78 < B < 73.16. As a preview of upcoming
discussions, note that these two 95% condence intervals do not overlap.
14.5.3
S/ n
(14.91)
Estimation
509
X t/2 (n 1)
< < X + t/2 (n 1)
(14.93)
n
n
Example 14.10: INTERVAL ESTIMATES FOR MEANS OF
PROCESS YIELDS: UNKNOWN POPULATION VARIANCES
Repeat the problem in Example 14.9 and obtain 95% condence interval estimates for the population means A and B , still assuming that
the data came from normal populations, but with unknown standard
deviations.
Solution:
As obtained in the earlier example, the sample averages remain:
yA = 75.52; yB = 72.47
(14.94)
(14.95)
(14.96)
(14.97)
A = 75.52 2.01(1.43/ 50) = 75.52 0.41
and
(14.98)
The 95% condence intervals for the mean yield for each process is now
as follows: for Process A, (75.11 < A < 75.93); and for Process B,
(71.69 < B < 73.25).
Note that these interval estimates are really not that dierent from
those obtained in Example 14.9; in fact, the estimates for A are virtually identical. There are two reasons for this: rst, and foremost, the
sample estimates of the population standard deviation, sA = 1.43; sB =
2.76, are fairly close to the corresponding population values A = 1.5
and B = 2.5; second, the sample size n = 50 is suciently large so
that the dierence between the t-distribution and the standard normal
is almost negligible (observe that z0.025 = 1.96 is only 2.5% less than
t0.025 (49) = 2.01).
Again, note that these two 95% condence intervals also do not
overlap.
510
Random Phenomena
14.5.4
Obtaining interval estimates for the variance of a normal population follows the same principles outlined above: obtain the sampling distribution of
an appropriate statistic (the estimator) and use it to make probabilistic statements about an interval that is expected to contain the unknown parameter.
In the case of the population variance, the estimator is:
n
2
(Xi X)
2
(14.99)
S = i=1
n1
and the sampling distribution is obtained from the fact that
(n 1)S 2
2 (n 1)
2
(14.100)
<
2/2 (n 1)
21/2 (n 1)
(14.102)
Estimation
511
0.10
f(x)
0.08
Chi-Square, df (n-1) =9
0.06
0.04
0.02
D = 0.025
D = 0.025
0.00
2.70
19.0
(14.104)
From here, using Eq (14.102), the following interval estimates for the
variances are obtained as:
2
< 3.18
1.43 < A
(14.105)
2
< 11.85
5.33 < B
(14.106)
or, upon taking square roots, the interval estimates for the standard
deviations are:
1.20 < A < 1.78
(14.107)
(14.108)
The results discussed above for interval estimates of single means and variances from normal populations have implications for hypothesis testing, as we
512
Random Phenomena
show in the next chapter. They can be extended to interval estimates of the
dierences between the means of two normal populations, as we will now do:
this also has implications for hypothesis testing.
14.5.5
(14.109)
the dierence between the two population means, by obtaining a point estimate along with a condence interval.
is the MLE for X , and Y is the MLE for Y , then it is straightforward
If X
dened as:
to show (see Exercise 14.27) that D,
=X
Y
D
(14.110)
is the MLE for XY ; it is also unbiased. And now, to obtain the interval esti we need its sampling distribution, which is obtained as follows: we
mate for D,
N (X , 2 /n) and Y N (Y , 2 /m);
know from previous results that X
X
Y
and from results about distributions of sums of Gaussian random variables,
N (XY , v 2 ) where:
we now obtain that D
XY
v2
= X Y
2
2
X
+ Y
=
n
m
(14.111)
(14.112)
(14.113)
2
by independence. And now, if X
and Y2 are known (so that v 2 is also known)
then observe that the statistic (DXY )/v has a standard normal distribution.
Thus
Y ) (X Y )
(X
% 2
N (0, 1)
(14.114)
2
X
Y
+
n
m
(X Y ) (X Y )
% 2
P z/2 <
< z/2 = 1
2
X
Y
n + m
(14.115)
Estimation
513
+ Y
(14.116)
XY = (X Y ) z/2
n
m
The next example illustrates this result.
Example 14.12: INTERVAL ESTIMATES FOR DIFFERENCE BETWEEN TWO PROCESS YIELD MEANS
Obtain a 95% condence interval estimate for the dierence between
the population means A and B for the process yield data in Example
14.9, given the respective population standard deviations as A = 1.5
and B = 2.5.
Solution:
Since yA = 75.52; yB = 72.47, so that d = yA yB = 3.05, the desired
95% condence interval is obtained from Eq (14.116) as
(14.117)
AB = 3.05 1.96 (2.25 + 6.25)/50 = 3.05 0.81
Thus, with 95% condence, we expect 2.24 < AB < 3.86.
The result of this example foreshadows part of the upcoming discussion in the
next chapter on hypothesis testing. For now we simply note the most obvious
implication: it is highly unlikely that the mean of the yield obtainable from
process A is the same as that from process B; in fact, the evidence seems
to support the postulate that the mean yield obtainable from process A is
greater than that from process B, by as little as 2.24 and possibly by as large
as 3.86.
This example also sheds light in general on how we can use the interval
estimate of the dierence between two means to assess the equality of two
normal population means:
1. If the interval estimate for X Y contains the number 0, the implication is that X and Y are very likely equal;
2. If the interval estimate for X Y lies entirely to the right of 0, the
implication is that, very likely, X > Y ; and nally,
3. If the interval estimate for X Y lies entirely to the left of 0, the
implication is that, very likely, X < Y .
2
and Y2 , are unknown, in general,
When the population variances, X
things become quite complicated, especially when n = m. Under these circumstances, it is customary to use
)
2
2
Y ) z/2 SX + SY
(14.118)
XY = (X
n
m
514
Random Phenomena
2
When X
and Y2 are unknown but equal to 2 , we can use the tdistribution as we have done previously to obtain
)
1
1
+
(14.119)
XY = (X Y ) t/2 ()Sp
n m
2
or,
&
Sp =
2 + (m 1)S 2
(n 1)SX
Y
(n + m 2)
(14.122)
the positive square root of a weighted average of the two sample variances.
14.5.6
While the most readily available results for interval estimates are for samples from Gaussian populations, it is still possible to obtain interval estimates
for parameters from non-Gaussian populations. One simply needs to remember that the key to interval estimation is the sampling distribution of the
estimator. If we are able to obtain the appropriate sampling distribution, it
can be used to make the sort of probabilistic statements on which interval
estimates are based.
Means; Large Samples
Fortunately, when sample sizes are large, it is possible to invoke the central limit theorem to determine that,
regardless of the underlying distribution
)/(/n) possesses an approximate N (0, 1) dis(Gaussian or not), (X
tribution, with the approximation improving as n . Furthermore, even
if is unknown (as is usually the case in most problems of practical relevance), the large sample size makes it acceptable to approximate with S,
the sample standard deviation. Thus, no new results are required under these
circumstances. The following example illustrates this point.
Example 14.13: INTERVAL ESTIMATE FOR MEAN OF INCLUSIONS DATA
The number of inclusions found on glass sheets produced in the manufacturing process discussed in Chapter 1 has been identied as a Poisson
random variable with parameter . If the data in Table 1.2 is considered
Estimation
515
(14.123)
= 1.02 1.96(1.1/ 60) = 1.02 0.28
so that, with 95% condence, we can expect the true mean number of
inclusions found on the glass sheets made in this manufacturing site to
be characterized as: 0.74 < < 1.30
516
Random Phenomena
1.4
1.2
1.0
Gamma(10, 0.1)
f(x)
0.8
0.6
0.4
0.2
0.025
0.0
0.025
0.480
1.71
FIGURE 14.4: Sampling distribution with two-sided tail area probabilities of 0.025 for
X/,
based on a sample of size n = 10 from an exponential population
dened as
X,
n
= 1
Xi
X
n i=1
(14.125)
has the Gamma distribution (n, /n). However, note that this pdf
depends on the unknown parameter and can therefore not be used,
as is, to make probabilistic statements. On the other hand, by scaling
with , we see that
X
1
X (n, 1/n)
(14.126)
a pdf that now depends only on the sample size, n. (This is directly
analogous to the t-distribution which depends only on the degrees of
freedom, (n 1)).
And now for the specic case with n = 10, we obtain from the
Gamma(10,0.1) distribution the following:
X
< 1.71 = 0.95
(14.127)
P 0.48 <
(see Fig 14.4) where the values for the interval boundaries are obtained
from MINITAB using the inverse cumulative probability feature.
For the specic case of the rst year data with x
= 30.1, the expression in Eq (15.144) may then be rearranged to yield the 95% condence
interval:
17.6 < 1 < 62.71
(14.128)
and for the second year data, with x
= 32.9,
19.24 < 2 < 68.54
(14.129)
Estimation
517
f(x)
Gamma(100, 0.01)
2
1
0.025
0
0.025
0.814
1.21
FIGURE 14.5: Sampling distribution with two-sided tail area probabilities of 0.025 for
X/,
based on a larger sample of size n = 100 from an exponential population
First, note the asymmetry in these intervals in relation to the respective
this should not come as a surprise,
values for the point estimates, X:
since the Gamma distribution is skewed to the right. Next, observe that
these intervals are quite wide; this is primarily due to the relatively
small sample size of 10. Finally, observe that the two intervals overlap
considerably, suggesting that the two estimates may not be dierent
at all; the dierence of 2.8 from year 1 to year 2 is more likely due to
random variation than to any actual systemic improvement.
X
X
< 1 <
= 24.88 < 1 < 36.98
1.21
0.814
(14.130)
(14.131)
Not surprisingly, these intervals are considerably tighter than the corresponding ones obtained for the smaller sample size n = 10.
518
Random Phenomena
23.17
= 30.1 1.96
= 30.1 4.54
100
22.51
= 32.9 1.96
= 32.9 5.39
100
(14.132)
(14.133)
which, when written in the same form as in Eqs (14.130) and (14.131), yields
25.56 < 1 < 34.54
27.51 < 2 < 38.29
(14.134)
(14.135)
14.6
14.6.1
Bayesian Estimation
Background
Estimation
519
3. Any available prior information about the random vector and its pdf,
can and should be used in conjunction with sample data in providing
parameter estimates.
This approach is known as Bayesian Estimation. Its basis is the fundamental
relationship between joint, conditional and marginal pdfs, which we may recall
from Chapter 4 as:
f (x, y)
(14.136)
f (x|y) =
f (y)
from which one obtains the following important result
f (x, y) = f (x|y)f (y) = f (y|x)f (x)
(14.137)
14.6.2
Basic Concept
(14.139)
This is the conditional pdf of the data conditioned on ; for any given value
of , this expression provides the probability of jointly observing the data
{x1 , x2 , . . . , xn } in the discrete case. For the continuous case, it is the density
function to be used in computing the appropriate probabilities. (Recall that
we earlier referred to this same expression as the likelihood function L() in
Eq 14.40.)
Now, if is considered a random variable for which is just one possible
realization, then in trying to determine , what we desire is the conditional
probability of given the data; i.e. f (|x1 , x2 , . . . , xn ), the reverse of Eq
(14.139). This is obtained by invoking Bayes Theorem,
f (|x1 , x2 , . . . , xn ) =
f (x1 , x2 , . . . , xn |)f ()
f (x1 , x2 , . . . , xn )
(14.140)
520
Random Phenomena
where
f (x1 , x2 , . . . , xn |)
f ()
f (x1 , x2 , . . . , xn )
f (|x1 , x2 , . . . , xn ) =
POSTERIOR
(14.141)
14.6.3
The primary result of Bayesian estimation is f (|x), the posterior pdf for
conditioned upon the observed data vector x; no point or interval estimates
are given directly. However, since f (|x) is a full pdf, both point and interval
estimates are easily obtained from it. For example, the mean, median, mode,
or for that matter, any reasonable quantile of f (|x) can be used as a point
estimate; and any interval, q1/2 < < q/2 , encompassing an area of
probability (1 ) can be used an an interval estimator. In particular,
1. The mean of the posterior pdf,
E[|X1 , X2 , . . . , Xn ] =
f (|x1 , x2 , . . . , xn )d
(14.142)
Estimation
521
1. Begin by specifying a prior distribution, f (), a summary of prior knowledge about the unknown parameters ;
2. Obtain sample data in the form of a random sample X1 , X2 , . . . , Xn , and
hence the sampling distribution (the joint pdf for these n independent
random variables with identical pdfs f (xi ; ));
3. From Eq (14.141) obtain the posterior pdf, f (|x1 , x2 , . . . , xn ); (if
needed, determine C such that f (|x1 , x2 , . . . , xn ) is a true pdf, i.e.
1
=
f ()f (x1 , x2 , . . . , xn |)d
(14.143)
C
4. If point estimates are required, obtain these from the posterior distribution.
14.6.4
A Simple Illustration
(14.145)
522
Random Phenomena
CASE II: In addition to the known range, there is also prior information,
for example, from a similar process for which on average, the probability
of success is 0.4, with a variability captured by a variance of 1/25. Under
these circumstances, again the maximum entropy results of Chapter 10 suggest
a beta distribution with parameters (, ) determined from the prescribed
mean and variance. From the expressions for the mean and variance of the
Beta(, ) random variable given in Chapter 9, we are able to solve the two
equations simultaneously to obtain
= 2; = 3
(14.146)
(14.147)
(14.149)
This being the case, the multiplying constants in Eq (14.148) must therefore
be (+)/()(). And because n and x are both integers in this problem,
we are able to use the factorial representation of the Gamma function (see Eq
(9.25)) to obtain the complete posterior pdf as:
f (|x) =
(n + 1)! x
(1 )nx
x!(n x)!
(14.150)
Estimation
523
n
(n + 1)!
=
x
x!(n x)!
(14.152)
(14.153)
as implied in Eq (14.150).
We may now choose to leave the result as the pdf in Eq (14.150) and
use it to make probabilistic statements about the parameter; alternatively, we
can determine point estimates from it. For example, the Bayess estimate, B ,
dened as E[|X] is obtained from Eq (14.150) as:
x+1
B =
n+2
(14.154)
where the result is obtained immediately by virtue of the posterior pdf being a
Beta distribution with parameters given in Eq (14.149); or else (the hard way!)
by computing the required expectation via direct integration. On the other
hand, the MAP estimate, , is obtained by nding the maximum (mode) of
the posterior pdf: from Eq (14.150) one obtains this as:
x
= ,
n
(14.155)
(14.156)
Thus, using a uniform prior distribution for f (), produces a Bayes estimate,
estimate of 1/3, compared to MAP estimate of 0.3 (which coincides with the
standard MLE estimate).
Note that in CASE I, the prior distribution is somewhat uninformative
and non-subjective, in the sense that it showed no preference for any value
of `
a-priori. Note that since x/n is known to be an unbiased estimate for
, then B in Eq (14.154) is biased. However, it can be shown, (see Exercises
14.8, 14.34 and 14.33) that the variance of B is always less than that of the
unbiased MLE, = x/n. Thus, the Bayes estimate may be biased, but it is
more ecient.
CASE II is dierent. The possible values of are not assigned equal a`priori probability; the a`-priori probability specied in Eq (14.147) denitely
favors some values over others. We shall return shortly to the obvious issue
524
Random Phenomena
of subjectivity that this approach raises. For now, with the pdf given in Eq
(14.147), the resulting posterior pdf is:
n x+1
f (|x) = C 12
(1 )nx+2
(14.157)
x
It is straightforward to establish (see Exercise 14.33) that the nal posterior
pdf is given by
f (|x) =
(n + 4)!
x+1 (1 )nx+2
(x + 1)!(n x + 2)!
(14.158)
so that the Bayes estimate and the MAP estimate in this case are given
respectively as:
x+2
(14.159)
B =
n+5
and
x+1
=
(14.160)
n+3
Again, with the specic experimental outcome of 3 successes in n = 10 trials,
we obtain the following as the CASE II estimates
B = 5/15 = 0.33; = 4/13 = 0.31
(14.161)
14.6.5
Discussion
Estimation
525
networks have their own characteristics and properties, ignoring earlier studies
of previous generations as if they did not exist is not considered good engineering practice. Whatever relevant prior information is available from previous
studies should be incorporated into the current analysis. Many areas of theoretical and applied sciences (including engineering) advance predominantly
by building on prior knowledge, not by ignoring available prior information.
Still, the possibility remains that the subjectivity introduced into data
analysis by the choice of the prior distribution, f (), could dominate the
objective information contained in the data. The counter-argument is that
the Bayesian approach is actually completely transparent in how it distinctly
separates out each component of the entire data analysis process what is
subjective and what is objective making it possible to assess, in an objective
manner, the inuence of the prior information on the nal result. It also allows
room for adaptation, in light of additional objective information.
Recursive Bayesian Estimation
It is in the latter sense noted above that the Bayesian approach provides
its most compelling advantage: recursive estimation. Consider a case that is
all too common in the chemical industry in which the value of a process variable, say viscosity of a polymer material, is to be determined experimentally.
The measured value of the process variable is subject to random variation
by virtue of the measurement device characteristics, but also intrinsically. In
particular, the true value of such a variable changes dynamically (i.e. from
one time instant to the next, the value will change because of dynamic operating conditions). If the objective is to estimate the current value of such a
variable, the frequentist approach as discussed in the earlier parts of this
chapter, is to obtain a random sample of size n from which the true value
will be estimated. Unfortunately, because of dynamics, only a single value is
obtainable at any point in time, tk , say x(tk ); at the next sampling point, the
observed value, x(tk+1 ) is, technically speaking, not the same as the previous
value, and in any event, the two can hardly be considered as independent of
each other. There is no realistic frequentist solution to this problem. However,
by postulating a prior distribution, one can obtain a posterior distribution
on the basis of a single data point; the resulting posterior distribution, which
now incorporates the information contained in the just-acquired data, can
now be used as the prior distribution for the next round. In this recursive
strategy, the admittedly subjective prior employed as an initial condition
is ultimately washed out of the system with progressive addition of objective,
but time-dependent data. A discussion of this type of problem is included in
the application case studies of Chapter 20.
Choosing Prior Distributions
The reader may have noticed that the choice of a Beta distribution as the
prior distribution, f (), for the Binomial/Bernoulli probability of success, p, is
526
Random Phenomena
(1 )1
(14.163)
()()
where, even though the rst is a function of the random variable, X, and
the second is a function of the parameter, , the two pdfs are seen to have
what is called a conjugate structure: multiplying one by the other results
in a posterior pdf where the conjugate structure is preserved. The Beta
pdf is therefore said to provide a conjugate prior for the Binomial sampling
distribution. The advantage of employing conjugate priors is therefore clear:
it simplies the computational work involved in determining the posterior
distribution.
For the Poisson P() random variable with unknown parameter = , the
conjugate prior is the Gamma distribution. Arranged side-by-side, the nature
of the conjugate structure becomes obvious:
f (x|)
f () =
e x
; x = 0, 1, 2, . . .
x!
1
e/ 1 ; 0 < <
()
(14.164)
(14.165)
Estimation
527
distributions. In many cases, the only option is to obtain the required posterior distributions (as well as point estimates) numerically. Until recently,
the computational burden of numerically determining posterior distributions
for practical problems constituted a considerable obstacle to the application
of Bayesian techniques in estimation. With the introduction of the Markov
Chain Monte Carlo (MCMC)1 techniques, however, this computational issue
has essentially been resolved. There are now commercial software packages for
carrying out such numerical computations quite eciently.
14.7
528
Random Phenomena
REVIEW QUESTIONS
1. The objective of this chapter is to provide answers to what sorts of questions?
2. What is estimation?
3. What are the two aspects of estimation discussed in this chapter?
4. What is an estimator?
5. What is a point estimate and how is it dierent from an estimator?
6. What is an interval estimator?
7. What is an interval estimate and how is it dierent from an interval estimator?
8. What is the mathematical denition of an unbiased estimator?
9. What makes unbiasedness an intuitively appealing criterion for selecting estimators?
Estimation
529
10. Mathematically, what does it mean that one estimator is more ecient than
another?
11. What is the mathematical denition of a consistent sequence of estimators?
12. What is the basic principle behind the method of moments technique for obtaining point estimates?
13. Are method of moments estimators unique?
14. What is the likelihood function and how is it dierentiated from the joint pdf
of a random sample?
15. What is the log-likelihood function? Why is it often used in place of the likelihood function in obtaining point estimates?
16. Are maximum likelihood estimates always unbiased?
17. What is the invariance property of maximum likelihood estimators?
18. What are the asymptotic properties of maximum likelihood estimators?
19. What is needed in order to quantify the precision of point estimates?
20. What are the two main components of an interval estimate?
21. What is the general procedure for determining interval estimates for the mean
of a normal population with known?
22. What is the dierence between interval estimates for the mean of a normal population when is known and when is unknown?
23. What probability distribution is used to obtain interval estimates for the variance of a normal population?
24. Why is the condence interval around the point estimate of a normal population
variance not symmetric?
25. How can interval estimates of the dierence between two normal population
means be used to assess the equality of these means?
26. How does one obtain interval estimates for parameters from other non-Gaussian
populations when samples sizes are large and when they are small?
27. What are the distinguishing characteristics of Bayesian estimation?
28. What is a prior distribution, f (), and what role does it play in Bayesian estimation?
530
Random Phenomena
29. Apart from the prior distribution, what other types of probability distributions
are involved in Bayesian estimation, and how are they related?
30. What is the primary result of Bayesian estimation?
31. What is the Bayes estimator? What is the maximum `
a-posteriori estimator?
32. What are some of the controversial aspects of Bayesian estimation?
33. What is recursive Bayesian estimation?
34. What is a conjugate prior distribution?
35. What are some of the computational issues involved in the practical application
of Bayesian estimation, and how have they been resolved?
EXERCISES
Section 14.1
14.1 Given a random sample X1 , X2 , . . . , Xn , from a Gaussian N (, 2 ) population
and the sample
with unknown parameters, it is desired to use the sample mean, X,
2S 2 /n
variance, S 2 , to determine point estimates of the unknown parameters; X
is to be used to determine an interval estimate. The following data set was obtained
for this purpose:
{9.37, 8.86, 11.49, 9.57, 9.15, 9.10, 10.26, 9.87}
(i) In terms of the random sample, what is the estimator for ; and in terms of the
supplied data, what is the point estimate for ?
(ii) What are the boundary estimators, UL and UR , such that (UL , UR ) is an interval
estimator for ? What is the interval estimate?
14.2 Refer to Exercise 14.1.
(i) In terms of the random sample, what is the estimator for 2 , and what is the
point estimate?
(ii) Given the boundary estimators for 2 as:
UL =
(n 1)S 2
(n 1)S 2
; UR =
CL
CR
Estimation
531
n
1
Xi ;
n i=1
M2
n
1 2
X
n i=1 i
g and
(i) In terms of the supplied information, what is the estimator for A = ln X
g )n ?
for = (X
the sample mean, from the
g , A, eA and for X,
(ii) Determine point estimates for X
following sample data:
{7.91, 5.92, 4.53, 33.26, 24.13, 5.42, 16.96, 3.93}
Section 14.2
14.5 Consider a random sample X1 , X2 , . . . , Xn ;
(i) If the sample is from a Lognormal population, i.e. X L(, ), so that
E(ln X) =
and if the sample geometric mean is dened as
n
1/n
g =
X
Xi
i=1
532
Random Phenomena
is not unbiased
show that this estimator is unbiased for 1/, but the estimator 1/X
n
1
2
(Xi X)
n 1 i=1
(14.166)
(14.167)
n
is unbiased for . Consider a second estimator dened as:
= X +1
n+2
is biased for and determine the bias B(), as dened in Eq (14.166)
(i) Show that
in Exercise 14.7.
and
respectively.
(ii) Let V and V represent the variances of the estimators
Show that
2
n
V
V =
n+2
is more ecient
and hence establish that V < V , so that the biased estimator
the unbiased estimator, especially for small sample sizes,.
than ,
14.9 Given a random sample X1 , X2 , . . . , Xn from a population with mean , and
variance 2 , dene two statistics as follows:
n
1
Xi
n i=1
n
i=1
i Xi ; with
n
i=1
i = 1
Estimation
533
It was shown in the text that both statistics are unbiased for ; now show that X,
is the more ecient estimator of .
a special case of X,
with nite
14.10 Given a random sample X1 , X2 , . . . , Xn from a general population
= (n Xi )/n, is
mean , and nite variance 2 , show that the sample mean, X
i=1
consistent for regardless of the underlying population. (Hint: Invoke the Central
Limit Theorem.)
Section 14.3
14.11 Given a random sample, X1 , X2 , . . . , Xn from a Poisson P() population,
obtain an estimate for the population parameter, , on the basis of the second
moment estimator,
n
1 2
X
M2 =
n i=1 i
2 , is explicitly given by:
and show that this estimate,
1
2 = 1
4M2 + 1
2
2
14.12 Refer to Exercise 14.11 and consider the following sample of size n = 20 from
a Poisson P(2.10) population.
3
4
0
1
1
2
2
3
3
2
2
1
4
3
2
5
1
0
2
2
534
Random Phenomena
1
1
6
1
2
5
2
14
9
2
14
2
2
6
1
3
2
1
1
1
(ii) Obtain the harmonic mean of the given data. As an estimate of the population
parameter p = 0.25, how does this estimate compare with the two method of moments estimates for p?
14.15 In terms of the rst two moments M1 and M2 , obtain two separate method of
moments estimators for the unknown parameter in the exponential E () distribution.
From the following data sampled from an exponential E (4) population, determine
numerical values of the two dierent point estimates and compare them to the true
population value.
6.99
0.52
10.36
5.75
2.84
0.67
1.66
0.12
0.41
2.72
3.26
6.51
3.75
5.22
1.78
4.05
2.16
16.65
1.31
1.52
14.16 On the basis of the rst two moments M1 and M2 , determine method of
moments estimates for the two parameters in the Beta B(, ) distribution.
14.17 Use the rst two moments M1 and M2 to determine two separate estimators
for the single Rayleigh R(b) distribution parameter.
14.18 On the basis of the rst two moments M1 and M2 , determine method of
moments estimates for the two parameters in the Gamma (, ) distribution.
14.19 Show that the likelihood function for the binomial random variable given in
Eq (14.66), i.e.,
n X
L(p) =
p (1 p)nX
X
is maximized when p = X/n, and hence establish the result stated in Eq (14.67).
14.20 Let X1 , X2 , . . . , Xn be a random sample from a geometric population with
Show
unknown parameter . Obtain the maximum likelihood estimate (MLE), .
that
=
E()
but,
so that is in fact biased for ,
E
1
1
=
Estimation
but,
E
535
1
1
=
536
Random Phenomena
Estimation
(14.157), i.e.,
f (|x) = C 12
537
n x+1
(1 )nx+2
(i) Show that in nal form, with the constant C evaluated, this posterior distribution
is:
(n + 4)!
x+1 (1 )nx+2
f (|x) =
(x + 1)!(n x + 2)!
hence conrming Eq (14.158). (Hint: Exploit the structural similarity between this
pdf and the Beta pdf.)
(ii) If x is the actual number of successes obtained in n trials, it is known that the
estimate = x/n is unbiased for . It was stated in the text that the Bayes and
MAP estimates, are, respectively,
x+2
x+1
; and =
B =
n+5
n+3
Show that these two estimates are both biased, but are both more ecient than .
Which of the three is the most ecient?
14.34 Let X1 , X2 , . . . , Xn , be a random sample from a Poisson distribution with
unknown parameter, .
(i) To estimate the unknown parameter with this sample, along with a Gamma
(a, b) prior distribution for , rst show that the posterior distribution f (|x) is a
Gamma (a , b ) distribution with
a =
n
Xi + a; and b =
i=1
1
n+
1
b
Hence show that the Bayes estimator, B = E(|X), is a weighted sum of the sample
and the prior distribution mean, p , i.e.,
mean, X
+ (1 w)p
B = wX
with the weight, w = nb/(nb + 1).
(ii) Now show that:
538
Random Phenomena
APPLICATION PROBLEMS
14.36 A cohort of 100 patients under the age of 35 years (the Younger group),
and another cohort of the same size, but 35 years and older (the Older group),
participated in a clinical study where each patient received ve embryos in an invitro fertilization (IVF) treatment cycle. The result from Assisted Reproductive
Technologies clinic where the study took place is shown in the table below. The
data shows x, the number of live births per delivered pregnancy, along with how
many in each group had the pregnancy outcome of x.
x
No. of live
births in a
delivered
pregnancy
0
1
2
3
4
5
yO
Total no. of
older patients
(out of 100)
with pregnancy outcome x
32
41
21
5
1
0
yY
Total no. of
younger patients
(out of 100)
with pregnancy outcome x
8
25
35
23
8
1
On the postulate that these data represent random samples from the binomial
Bi(n, O ) population for the Older group, and Bi(n, Y ) for the Younger group,
obtain 95% condence interval estimates of both parameters, O and Y .
Physiologically, these parameters represent the single embryo probability of success (i.e., resulting in a live birth at the end of the treatment cycle) for the patients
in each group. Comment on whether or not the results of this clinical study indicate that these cohort groups have dierent IVF treatment success rates, on average.
14.37 The number of contaminant particles (aws) found on each standard size silicon wafer produced at a certain manufacturing site is a random variable, X. In order
to characterize this random variable, a random sample of 30 silicon wafers selected
by a quality control engineer and examined for aws resulted in the data shown in
the table below, a record of the number of aws found on each wafer.
4
3
3
1
0
4
2
0
1
3
2
1
2
3
2
1
0
2
2
3
5
4
2
3
0
1
1
1
2
1
Estimation
539
months), over a 10-year period, at a DuPont company facility, separated into two
periods: Period I is the rst ve-year period of the study; Period II, the second veyear period.
5
4
2
5
6
Period I
5 10 8
5 7
3
8 6
9
6 5 10
3 3 10
3
1
7
1
4
Period II
4 2 0
3 2 2
7 1 4
2 2 1
4 4 4
(i) Consider that the entire data set constitutes a random sample of size 40 from
a single Poisson population with unknown parameter. Estimate the parameter with
a 95% condence interval.
(ii) Now consider the data for each period as representing two dierent random
samples of size 20 each, from two dierent Poisson populations, with unknown parameters 1 for Period I, 2 for Period II. Estimate these parameters with separate,
approximate 95% condence intervals. Compare these two interval estimates and
comment on whether or not these two populations are indeed dierent. If the populations appear dierent, what do you think may have happened between Period I
and Period II at the DuPont company facility that was the site of this study?
14.39 An exotic u virus with a long incubation period reappears every year during
the long u season. Unfortunately, there is only a probability p that an infected
patient will show symptoms within the rst month; as such the early symptomatic
patients constitute only the leading edge of the infected members of the population. Assuming, for the sake of simplicity, that once infected, the total number
of infected individuals does not increase (the virus is minimally contagious), and
assuming that all symptomatic patients eventually come to the same hospital, the
following data was obtained over a period of ve years by an epidemiologist working
with the local hospital doctors. NE is the number of early symptomatic patients;
NT is the total number of infected patients treated that year. The unknown probability p is to be determined from this data in order to enable doctors to prepare for
the viruss reappearance the following year.
Year
1
2
3
4
5
Early
Symptomatics
NE
5
3
3
7
2
Total
Infected
NT
7
8
10
7
8
(i) Why is this a negative binomial phenomenon? Determine the values of the
negative binomial parameter k and the random variable X from this data set.
(ii) Obtain an expression for the maximum likelihood estimate of p in terms of a
general random sample of ki , Xi ; i = 1, 2, . . . , n. Why is it not possible to use the
method of moments to estimate p in this case?
(iii) Determine from the data an actual estimate p. Use this estimate to generate a 7 7 table of probabilities f (x|k) for values of k = 1, 2, 3, 4, 5, 6, 7 and
540
Random Phenomena
Observed
Frequency
447
132
42
21
3
2
1.31
1.94
0.79
0.15
3.21
1.22
3.02
2.91
0.65
3.17
1.66
3.90
4.84
1.51
0.18
0.71
0.30
0.57
0.70
0.05
7.26
1.41
1.62
0.43
2.68
6.75
0.96
0.68
1.29
3.76
3 Greenwood M. and Yule, G. U. (1920) An enquiry into the nature of frequency distributions representative of multiple happenings with particular reference of multiple attacks
of disease or of repeated accidents. Journal Royal Statistical Society 83:255-279.
Estimation
19.2
9.0
17.2
8.2
4.5
13.5
20.7
7.9
19.5
8.8
18.7
7.4
9.7
13.7
8.1
8.4
10.8
15.1
5.3
12.0
3.0
18.5
5.8
6.8
14.5
3.3
11.1
16.4
7.3
7.4
7.3
5.2
10.2
3.1
9.6
12.9
17.3
6.0
24.3
21.3
19.3
2.5
9.1
8.1
9.8
15.4
15.7
8.2
8.8
7.2
12.8
541
4.2
4.2
7.8
9.5
3.9
8.7
5.9
5.3
1.8
10.1
10.0
18.7
5.6
3.3
7.3
11.3
2.9
5.4
15.2
8.0
11.7
17.2
4.0
3.8
7.4
5.3
10.6
15.2
11.5
5.9
20.1
12.2
12.0
8.8
f (x) =
px
; 0 < p < 1; x = 1, 2, . . . ,
x
where
=
1
ln(1 p)
4 Fisher, R. A., S. Corbet, and C. B. Williams. (1943). The relation between the number
of species and the number of individuals in a random sample of an animal population.
Journal of Animal Ecology, 1943: 4258.
542
Random Phenomena
x
No of species
(x)
Observed Frequency
x
No of species
(x)
Observed Frequency
x
No of species
(x)
Observed Frequency
118
74
44
24
29
22
20
19
10
11
12
13
14
15
16
20
15
12
14
12
17
18
19
20
21
22
23
24
10
10
11
It can be shown that for this random variable and its pdf,
E(X) = p/(1 p)
V ar(X) = p(1 p)(1 p)2
Show that the MLE of the unknown population parameter p and the method of
moments estimator based on the rst moment coincide. Obtain an estimate of this
population parameter for this data set. Compare the model prediction to the data.
14.44 The data in the table below, on the wall thickness (in ins) of cast aluminum
cylinder heads used in aircraft engine cooling jackets, is from Mee (1990)5 .
0.223
0.201
0.228
0.223
0.214
0.224
0.193
0.231
0.223
0.237
0.213
0.217
0.218
0.204
0.215
0.226
0.233
0.219
(i) Consider this as a random sample from a normal population with unknown
parameters and 2 . Determine 95% condence interval estimates of both the mean
and the variance of the wall thickness.
(ii) If a customer requires that these wall thicknesses be made to within the specications, 0.220 0.030 ins, what is the probability that the manufacturer will meet
these specications? State any assumptions required in answering this question.
14.45 The intrinsic variations in the measured amount of pollution contained in
water samples from rivers and streams in a mining region of West Central United
States is known to be a normally distributed random variable with a fairly stable
standard deviation of 5 milligrams of solids per liter. As an EPA inspector who
wishes to test a random selection of n water samples in order to determine the
mean daily rate of pollution within 1 milligram per liter with 95% condence, how
many water samples will you need to take? A selection of 6 randomly selected water
samples returned a mean value of 40 milligrams per liter and what seems like an
excessive variance of 30 (mg/liter)2 . Determine a 95% condence interval around
this estimate of 2 and comment on whether or not this value is excessive compared
to the assumed population value.
14.46 The time (in weeks) between occurrences of minor safety incidents in a
5 Mee, R. W., (1990). An improved procedure for screening based on a correlated, normally distributed variable, Technometrics, 32, 331337.
Estimation
543
1 x/
e
;0 < x <
0.15
0.70
3.02
1.41
3.17
2.68
4.84
0.68
(ii) A sample of 30 silicon wafers was examined for aws and the result (the number
of aws found on each wafer) is displayed in the table below.
544
Random Phenomena
3
4
3
1
0
4
2
0
1
3
2
1
2
0
2
1
0
2
2
0
5
2
2
3
0
1
1
1
2
1
Use a gamma (2, 2) distribution as the prior distribution for and obtain, using
this data set:
(a) the maximum likelihood estimate;
(b) the Bayes estimate, and
(c) Repeat (a) and (b) using only the rst 10 data points in the rst row of the data
table. Comment on how an increase in the number of available data points aects
parameter estimation in this particular case.
14.48 Consider the case in which k , the true value of a process variable at time
instant k, is measured as yk , i.e.,
yk = k + k
(14.168)
where k , the measurement noise, is usually considered random. The standard procedure for obtaining a good estimate of k involves taking repeated measurements
and averaging.
However, many circumstances arise in practice when such a strategy is infeasible primarily because of signicant process dynamics. Under these circumstances,
the process variable changes signicantly with time during the period of repeated
sampling; the repeated measurements thus provide information about the process
variable at dierent time instants and not true replicates of the desired measurement
at a specic time instant, k. Decisions must therefore be made on the true value, k ,
from the single, available measurement yk a non-standard problem which may be
solved using the Bayesian approach as follows:
(i) Theory: Consider yk as a realization of the random variable, Yk , possessing a
normal N (k , 2 ) distribution with unknown mean k , and variance 2 ; then consider
that the (unknown) process dynamics can be approximated by the simple random
walk model:
(14.169)
k = k1 + wk
where wk , the process noise, is a sequence of independent realizations of the zero
mean, Gaussian random variable, W , with a variance v 2 ; i.e., W N (0, v 2 ).
This process dynamic model is equivalent to declaring that k , the unknown true
mean value of Yk , has a prior pdf N (k1 , v 2 ); the measurement equation above, Eq
(14.168), implies that the sampling distribution for Yk is given by:
(yk k )2
1
f (yk |k ) = exp
2 2
2
Combine this with the prior distribution and obtain an expression for the posterior
distribution f (k |yk ) and show that the result is a Gaussian pdf with mean k ,
variance
2 given by:
k
k1 + (1 )yk
(14.170)
2 + v2
(14.171)
Estimation
and
2 =
1
2
1
+
545
(14.172)
1
v2
(14.173)
for estimating the true value k from the single data point yk . This expression is
recognizable to engineers as the popular discrete, (rst order) exponential lter, for
which is usually taken as a tuning parameter.
(ii)Application: Apply the result in part (i) above to lter the following raw
data, representing 25 hourly measurements of a polymer products solution viscosity (in scaled, coded units), using = 0.20, and the initial condition 1 = 20.00.
k
1
2
3
4
5
6
7
8
9
10
11
12
13
yk
20.82
20.92
21.46
22.15
19.76
21.91
22.13
24.26
20.26
20.35
18.32
19.24
19.99
k
14
15
16
17
18
19
20
21
22
23
24
25
yk
18.65
21.48
21.85
22.34
20.35
20.32
22.10
20.69
19.74
20.27
23.33
19.69
Compare a time sequence plot of the resulting ltered value, k , with that of
the raw data. Repeat with = 0.80 (and the same initial condition) and comment
on which lter parameter value (0.20 or 0.80) provides estimates that are more
representative of the dynamic behavior exhibited by the raw data.
14.49 Padgett and Spurrier (1990)6 obtained the following data set for the breaking
strengths (in GPa) of carbon bers used in making composite materials.
1.4
3.2
2.2
1.8
1.6
3.7
1.6
1.2
0.4
1.1
3.0
0.8
5.1
3.7
2.0
1.4
5.6
2.5
2.5
1.6
1.0
1.7
1.2
0.9
2.1
2.8
1.6
3.5
1.6
1.9
4.9
2.0
2.2
2.8
2.9
3.7
1.2
1.7
4.7
2.8
1.8
1.1
1.3
2.0
2.1
1.6
1.7
4.4
1.8
3.7
It is known that this phenomenon is well-modeled by the Weibull W (, ) distribution; the objective is to determine values for these unknown population parameters from this sample data, considered as a random sample with n = 50. However,
6 Padgett, W.J. and J. D. Spurrier, (1990). Shewhart-type charts for percentiles of
strength distributions. J of Quality Tech. 22, 283388.
546
Random Phenomena
(14.174)
(14.175)
Observe therefore that given F (x) and x, a plot of ln {ln[1 F (x)]} versus ln X
should result in a straight line with slope = and intercept = ln , from which the
appropriate values may be determined for the two unknown parameters.
Employ the outlined technique to determine from the supplied data your best
estimate of the unknown parameters. Compare your results to the true values
= 2.5 and = 2.0 used by Padgett and Spurrier in their analysis.
14.50 Polygraphs, the so-called lie-detector machines based on physiological measurements such as blood pressure, respiration and perspiration, are used frequently
in government agencies and other businesses where employees handle highly classied information. While polygraph test results are sometimes permitted in some
state courts, they are not admissible in federal courts in part because of potential
errors and the implications of such errors on the fairness of the justice system.
Since the basic premise of these machines is the measurement of human physiological variables, it is possible to evaluate the performance of polygraphs in somewhat
the same manner as one would any other medical diagnostic machine. (See Phillips,
Brett, and Beary, (1986)7 for one such study carried out by a group of physicians.)
The data shown below is a compilation of the result of an extensive study (similar
to the Phillips et al., study) in which a group of volunteers were divided into two
equal-numbered subgroups of truth tellers and liars. The tests were repeated
56 times over a period of two weeks and the results tabulated as shown: XA is the
fraction of the truth tellers falsely identied as liars by a Type A polygraph
machine (i.e., false positives); XB is the set of corresponding results for the same
subjects, under conditions as close to identical as possible using a Type B polygraph
machine. Conversely, YA is the fraction of liars misidentied as truth-tellers
(i.e., false negatives) by the Type A machine with YB as the corresponding results
using the Type B machine.
Postulate a reasonable probability model for the random phenomenon in question, providing a brief but adequate justication for your choice. Estimate the model
parameters for the four populations and discuss how well your model ts the data.
7 M. Phillips, A. Brett, and J. Beary, (1986). Lie Detectors Can Make a Liar Out of
You, Discover, June 1986, p. 7
Estimation
XA
0.128
0.264
0.422
0.374
0.240
0.223
0.281
0.316
0.341
0.397
0.037
0.097
0.112
0.216
0.265
0.225
0.253
0.211
0.301
0.469
0.410
0.454
0.278
0.236
0.118
0.109
0.035
0.269
Polygraph Data
YA
XB
0.161 0.161
0.117 0.286
0.067 0.269
0.158 0.380
0.105 0.498
0.036 0.328
0.210 0.159
0.378 0.391
0.283 0.154
0.166 0.216
0.212 0.479
0.318 0.049
0.144 0.377
0.281 0.327
0.238 0.563
0.043 0.169
0.200 0.541
0.299 0.338
0.106 0.438
0.161 0.242
0.151 0.461
0.200 0.694
0.129 0.439
0.222 0.194
0.245 0.379
0.308 0.368
0.019 0.426
0.146 0.597
547
YB
0.064
0.036
0.214
0.361
0.243
0.235
0.024
0.114
0.067
0.265
0.378
0.004
0.043
0.271
0.173
0.040
0.410
0.031
0.131
0.023
0.159
0.265
0.013
0.190
0.030
0.069
0.127
0.144
548
Random Phenomena
XA
0.175
0.425
0.119
0.380
0.234
0.323
0.356
0.401
0.444
0.326
0.484
0.280
0.435
0.172
0.235
0.418
0.366
0.077
0.352
0.231
0.175
0.290
0.099
0.254
0.556
0.407
0.191
0.232
Polygraph Data
YA
XB
0.368 0.441
0.327 0.412
0.698 0.295
0.054 0.136
0.070 0.438
0.057 0.445
0.506 0.239
0.142 0.207
0.356 0.251
0.128 0.430
0.108 0.195
0.281 0.429
0.211 0.581
0.333 0.278
0.100 0.151
0.114 0.374
0.083 0.638
0.251 0.187
0.085 0.680
0.225 0.198
0.325 0.533
0.352 0.187
0.185 0.340
0.287 0.391
0.185 0.318
0.109 0.102
0.049 0.512
0.076 0.356
YB
0.024
0.218
0.057
0.081
0.085
0.197
0.111
0.011
0.029
0.229
0.546
0.039
0.061
0.136
0.014
0.055
0.031
0.239
0.106
0.066
0.132
0.240
0.070
0.197
0.071
0.351
0.072
0.048
Proportion, p
Binomial
Variance, 2
(n < 30)
Mean,
Population
Parameter
TABLE 14.1:
X
n
S2 =
n1
i=1 (Xi X)
n
pq/n
N/A
E()
V ar()
n
X
= i=1 i
X
2 /n
n
s/ n
n
p z/2
p
q
n
(n1)S 2
21/2 (n1)
s
n
2
< 2 <
%
(n1)S 2
2/2 (n1)
t/2 (n 1)
X
Conf. Interval
Estimator
(1 ) 100%
z/2
X
Estimation
549
550
Random Phenomena
TABLE 14.2:
Beta, B(, )
C2 1 (1 )1
C1 x (1 )nx
Gamma, (, )
C2 1 e/
n
( *i=1 xi ) e
n
i=1 xi !
e(
1
2
n
i=1
exp
()1/2 2
Gamma, (, )
C2 1 e/
xi )
n
i=1 (xi )
22
exp
n
i=1 (xi )
Gaussian,N (, v 2 )
)2
1
exp (
2v 2
v 2
Inverse Gamma,
IG(, )
C
+1 e
Chapter 15
Hypothesis Testing
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.2.1 Terminology and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Statistical Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Test Statistic, Critical Region and Signicance Level . . . . . . . . . .
Potential Errors, Risks, and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sensitivity and Specicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.2.2 General Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.3 Concerning Single Mean of a Normal Population . . . . . . . . . . . . . . . . . . . . . . . .
15.3.1 Known; the z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Using MINITAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.3.2 Unknown; the t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Using MINITAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.3.3 Condence Intervals and Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . .
15.4 Concerning Two Normal Population Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.4.1 Population Standard Deviations Known . . . . . . . . . . . . . . . . . . . . . . . . . .
15.4.2 Population Standard Deviations Unknown . . . . . . . . . . . . . . . . . . . . . . .
Equal standard deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Using MINITAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Unequal standard deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Condence Intervals and Two-Sample tests . . . . . . . . . . . . . . . . . . . .
An Illustrative Example: The Yield Improvement Problem . . . .
15.4.3 Paired Dierences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.5 Determining , Power, and Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.5.1 and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.5.2 Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.5.3 and Power for Lower-Tailed and Two-Sided Tests . . . . . . . . . . . . .
15.5.4 General Power and Sample Size Considerations . . . . . . . . . . . . . . . . . .
15.6 Concerning Variances of Normal Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.6.1 Single Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.6.2 Two Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.7 Concerning Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.7.1 Single Population Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Large Sample Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exact Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.7.2 Two Population Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.8 Concerning Non-Gaussian Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.8.1 Large Sample Test for Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.8.2 Small Sample Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.9 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.9.1 General Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.9.2 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Normal Population; Known Variance . . . . . . . . . . . . . . . . . . . . . . . . . . .
552
553
554
554
556
557
558
559
560
561
563
567
570
573
573
576
576
578
578
580
581
581
582
585
589
591
593
596
598
599
600
601
603
606
607
608
609
610
612
613
614
616
616
618
619
551
552
Random Phenomena
620
622
623
624
626
629
637
Since turning our attention fully to Statistics in Part IV, our focus has been
on characterizing the population completely, using nite-sized samples. The
discussion that began with sampling in Chapter 13, providing the mathematical foundation for characterizing the variability in random samples, and which
continued with estimation in Chapter 14, providing techniques for determining
values for populations parameters, concludes in this chapter with hypothesis
testing. This nal tier of the statistical inference edice is concerned with
making and testing assertive statements about the population. Such
statements are often necessary to solve practical problems, or to answer questions of practical importance; and this chapter is devoted to presenting the
principles, practice and mechanics of testing the validity of hypothesized statements regarding the distribution of populations. The chapter covers extensive
ground from traditional techniques applied to traditional Gaussian problems, to non-Gaussian problems and some non-traditional techniques; it ends
with a brief but frank discussion of persistent criticisms of hypothesis tests
and some practical recommendations for handling such criticisms.
15.1
Introduction
Hypothesis Testing
553
To deal with the problem systematically, inherent random variability compels us to start by characterizing the populations fully with pdfs which are
then used to answer these questions. This requires that we postulate an appropriate probability model, and determine values for the unknown parameters
from sample data.
Here is what we know thus far (from Chapters 1 and 12, and from the various examples in Chapter 14): we have plotted histograms of the data and postulated that these are samples from Gaussian-distributed populations; we have
computed sample averages, yA , yB , and sample standard deviations, sA , sB ;
and in the various Chapter 14 examples, we have obtained point and interval estimates for the population means A , B , and the population standard
deviations A , B .
But by themselves, these results are not quite sucient to answer the questions posed above. To answer the questions, consider the following statements
and the implications of being able to conrm or refute them:
1. YA is a random variable characterized by a normal population with mean
value 75.5 and standard deviation 1.5, i.e., YA N (75.5, 1.52); similarly,
YB N (72.5, 2.52); as a consequence,
2. The random variables, YA and YB , are not from the same distribution
2
2
because A = B and A
= B
; in particular, A > B ;
3. Furthermore, A B > 2.
This is a collection of assertions about these two populations, statements
which, if conrmed, will enable us answer the questions raised. For example,
Statement #1 will allow us to answer Question 1 by making it possible to
compute the probabilities P (YA 74.5) and P (YB 74.5); Statement #2 will
allow us to answer Question 2, and Statement #3, Question 3. How practical
problems are formulated as statements of this type, and how such statements
are conrmed or refuted, all fall under the formal subject matter of hypothesis
testing. In general, the validity of such statements (or other assumptions about
the population from which the sample data were obtained) is checked by
1. Proposing an appropriate statistical hypothesis about the problem at
hand; and
554
15.2
15.2.1
Random Phenomena
Basic Concepts
Terminology and Denitions
Before launching into a discussion of the principles and mechanics of hypothesis testing, it is important to introduce rst some terminology and denitions.
Statistical Hypothesis
A statistical hypothesis is a statement (an assertion or postulate) about
the distribution of one or more populations. (Theoretically, the statistical
hypothesis is a statement regarding one or more postulated distributions for
the random variable X distributions for which the statement is presumed to
be true. A simple hypothesis species a single distribution for X; a composite
hypothesis species more than one distribution for X.)
Modern hypothesis testing involves two hypotheses:
1. The null hypothesis, H0 , which represents the primary, status quo
hypothesis that we are predisposed to believe as true (a plausible explanation of the observation) unless there is evidence in the data to indicate
otherwise in which case, it will be rejected in favor of a postulated
alternative.
2. The alternative hypothesis, Ha , the carefully dened complement to H0
that we are willing to consider in replacement if H0 is rejected.
For example, the portion of Statement #1 above concerning YA may be formulated more formally as:
H0 : A = 75.5
Ha : A = 75.5
(15.1)
The implication here is that we are willing to entertain the fact that the true
value of A , the mean value of the yield obtainable from process A, is 75.5;
that any deviation of the sample data average from this value is due to purely
random variability and is not signicant (i.e., that this postulate explains
the observed data). The alternative is that any observed dierence between
the sample average and 75.5 is real and not just due to random variability;
that the alternative provides a better explanation of the data. Observe that
this alternative makes no distinction between values that are less than 75.5
or greater; so long as there is evidence that the observed sample average is
dierent from 75.5 (whether greater than or less than), H0 is to be rejected
in favor of this Ha . Under these circumstances, since the alternative admits
of values of A that can be less than 75.5 or greater than 75.5, it is called a
two-sided hypothesis.
Hypothesis Testing
555
It is also possible to formulate the problem such that the alternative actually chooses sides, for example:
H0 : A = 75.5
Ha : A < 75.5
(15.2)
In this case, when the evidence in the data does not support H0 the only other
option is that A < 75.5. Similarly, if the hypotheses are formulated instead
as:
H0 : A = 75.5
Ha : A > 75.5
(15.3)
the alternative, if the equality conjectured by the null hypothesis fails, is that
the mean must then be greater. These are one-sided hypotheses, for obvious
reasons.
A test of a statistical hypothesis is a procedure for deciding when to reject
H0 . The conclusion of a hypothesis test is either a decision to reject H0 in
favor of Ha or else to fail to reject H0 . Strictly speaking, one never actually
accepts a hypothesis; one just fails to reject it.
As one might expect, the conclusion drawn from a hypothesis test is shaped
by how Ha is framed in contrast to H0 . How to formulate the Ha appropriately
is best illustrated with an example.
Example 15.1: HYPOTHESES FORMULATION FOR COMPARING ENGINEERING TRAINING PROGRAMS
As part of an industrial training program for chemical engineers in their
junior year, some trainees are instructed by Method A, and some by
Method B. If random samples of size 10 each are taken from large groups
of trainees instructed by each of these two techniques, and each trainees
score on an appropriate achievement test is shown below, formulate a
null hypothesis H0 , and an appropriate alternative Ha , to use in testing
the claim that Method B is more eective.
Method A
Method B
71
72
75
77
65
84
69
78
73
69
66
70
68
77
71
73
74
65
68
75
Solution:
We do return to this example later to provide a solution to the problem
posed; for now, we address only the issue of formulating the hypotheses
to be tested.
Let A represent the true mean score for engineers trained by
Method A, and B , the true mean score for those trained by the other
method. The status quo postulate is to presume that there is no difference between the two methods; that any observed dierence is due
to pure chance alone. The key now is to inquire: if there is evidence in
the data that contradicts this status quo postulate, what end result are
we interested in testing this evidence against? Since the claim we are
556
Random Phenomena
interested in conrming or refuting is that Method B is more eective,
then the proper formulation of the hypotheses to be tested is as follows:
H0 : A = B
Ha : A < B
(15.4)
By formulating the problem in this fashion, any evidence that contradicts the null hypothesis will cause us to reject it in favor of something
that is actually relevant to the problem at hand.
Note that in this case specifying Ha as A = B does not help us
answer the question posed; by the same token, neither does specifying
Ha as A > B because if it is true that A < B , then the evidence in
the data will not support the alternative a circumstance which, by
default, will manifest as a misleading lack of evidence to reject H0 .
Hypothesis Testing
557
0.4
H0
f(q)
0.3
0.2
0.1
D
0.0
QT > q
FIGURE 15.1: A distribution for the null hypothesis, H0 , in terms of the test statistic,
QT , where the shaded rejection region, QT > q, indicates a signicance level,
for the random variable, X; but since the random sample from X is usually
converted to a test statistic, there is a corresponding mapping of this region
by QT (.); it is therefore acceptable to refer to the critical region in terms of
the test statistic.
Now, because the estimator U is a random variable, the test statistic will
itself also be a random variable, with the following serious implication: there is
a non-zero probability that QT RC even when H0 is true. This unavoidable
consequence of random variability forces us to design the hypothesis test such
that H0 is rejected only if it is highly unlikely for QT RC when H0 is true.
How unlikely is highly unlikely? This is quantied by specifying a value
such that
(15.5)
P (QT RC |H0 true)
with the implication that the probability of rejecting H0 , when it is in fact
true, is never greater than . This quantity, often set in advance as a small
value (typically, 0.1, 0.05, or 0.01), is called the signicance level of the test.
Thus, the signicance level of a test is the upper bound on the probability of
rejecting H0 when it is true; it determines the boundaries of the critical region
RC .
These concepts are illustrated in Fig 15.1 and lead directly to the consideration of the potential errors to which hypothesis tests are susceptible, the
associated risks, and the sensitivity of a test in leading to the correct decision.
558
Random Phenomena
TABLE 15.1:
DECISION
TRUTH
H0 True
Ha True
(15.6)
(15.7)
so that the probability of committing a Type II error is called the -risk. The
probability of correctly rejecting a null hypothesis that is false is therefore
(1 ).
It is important now to note that the two correct decisions and the probabilities associated with each one are fundamentally dierent. Primarily because
H0 is the status quo hypothesis, correctly rejecting a null hypothesis, H0 ,
that is false is of greater interest because such an outcome indicates that the
test has detected the occurrence of something signicant. Thus, (1 ), the
probability of correctly rejecting the false null hypothesis when the alternative
hypothesis is true, is known as the power of the test. It provides a measure of
the sensitivity of the test. These concepts are summarized in Table 15.1 and
also in Fig 15.2.
Sensitivity and Specicity
Because their results are binary decisions (reject H0 or fail to reject it),
hypothesis tests belong in the category of binary classication tests; and the
Hypothesis Testing
Pa
P0
559
D
qc
FIGURE 15.2: Overlapping distributions for the null hypothesis, H0 (with mean 0 ),
and alternative hypothesis, Ha (with mean a ), showing Type I and Type II error risks
, , along with qC the boundary of the critical region of the test statistic, QT
eectiveness of such tests are characterized in terms of sensitivity and specicity. The sensitivity of a test is the percentage of true positives (in this
case, H0 deserving of rejection) that it correctly classies as such. The specicity is the percentage of true negatives (H0 that should not be rejected)
that is correctly classied as such. Sensitivity therefore measures the ability to identify true positives correctly; specicity, the ability to identify true
negatives correctly.
These performance measures are related to the risks and errors discussed
previously. If the percentages are expressed as probabilities, then sensitivity
is (1 ), and specicity, (1 ). The fraction of false positives (H0 that
should not be rejected but is) is ; the fraction of false negatives (H0 that
should be rejected but is not) is . As we show later, for a xed sample size,
improving one measure can only be achieved at the expense of the other, i.e.,
improvements in specicity must be traded o for a commensurate loss of
sensitivity, and vice versa.
The p-value
Rather than x the signicance level, , ahead of time, suppose it is free to
vary. For any given value of , let the corresponding critical/rejection region
be represented as RC (). As discussed above, H0 is rejected whenever the
test statistic, QT , is such that QT RC (). For example, from Fig 15.1, the
region RC () is the set of all values of QT that exceed the specic value q.
Observe that as decreases, the size of the set RC () also decreases, and
vice versa. The smallest value of for which the specic value of the test
statistic QT (x1 , x2 , . . . , xn ) (determined from the data set x1 , x2 , . . . , xn ) falls
in the critical region (i.e., QT (x1 , x2 , . . . , xn ) RC ()) is known as the p-value
associated with this data set (and the resulting test statistic). Technically,
560
Random Phenomena
(15.8)
In words, this is the probability of obtaining the specic test statistic value, q,
or something more extreme, if the null hypothesis is true. Note that p, being
a function of a statistic, is itself a statistic a subtle point that is often easy
to miss; the implication is that p is itself subject to purely random variability.
Knowing the p-value therefore allows us to carry out hypotheses tests at
any signicance level, without restriction to pre-specied values. In general,
a low value of p indicates that, given the evidence in the data, the null hypothesis, H0 , is highly unlikely to be true. This follows from Eq (15.8). H0 is
then rejected at the signicance level, p, which is why the p-value is sometimes
referred to as the observed signicance level observed from the sample data,
as opposed to being xed, `
a-priori, at some pre-specied value, .
Nevertheless, in many applications (especially in scientic publications),
there is an enduring traditional preference for employing xed signicance
levels (usually = 0.05). In this case, the p-value is used to make decisions
as follows: if p < , H0 will be rejected at the signicance level ; if p > ,
we fail to reject H0 at the same signicance level .
15.2.2
General Procedure
The general procedure for carrying out modern hypotheses tests is as follows:
1. Dene H0 , the hypothesis to be tested, and pair it with the alternative
Ha , formulated appropriately to answer the question at hand;
2. Obtain sample data, and from it, the test statistic relevant to the problem at hand;
3. Make a decision about H0 as follows: Either
(a) Specify the signicance level, , at which the test is to be performed, and hence determine the critical region (equivalently, the
critical value of the test statistic) that will trigger rejection; then
(b) Evaluate the specic test statistic value in relation to the critical
region and reject, or fail to reject, H0 accordingly;
or else,
Hypothesis Testing
561
15.3
(15.10)
Next, we need to gather evidence in the form of sample data from process
A. Such data, with n = 50, was presented in Chapter 1 (and employed in the
examples of Chapter 14), from which we have obtained a sample average,
562
Random Phenomena
/ n
(15.11)
has the standard normal distribution, provided that is known. This immediately suggests, within the context of hypothesis testing, that the following
test statistic:
yA 75.5
Z=
(15.12)
1.5/ n
may be used to test the validity of the hypothesis, for any sample average
computed from any sample data set of size n. This is because we can use
Z and its pdf to determine the critical/rejection region. In particular, by
specifying a signicance level = 0.05, the rejection region is determined as
the values z such that:
RC = {z|z < z0.025 ; z > z0.025 }
(15.13)
(because this is a two-sided test). From the cumulative probability characteristics of the standard normal distribution, we obtain (using computer programs
such as MINITAB) z0.025 = 1.96 as the value of the standard normal variate
for which P (Z > z0.025 ) = 0.025, i.e.,
RC = {z|z < 1.96; z > 1.96}; or |z| > 1.96
(15.14)
The implication: if the specic value computed for Z from any sample data
set exceeds 1.96 in absolute value, H0 will be rejected.
In the specic case of yA = 75.52 and n = 50, we obtain a specic value
for this test statistic as z = 0.094. And now, because this value z = 0.094 does
not lie in the critical/rejection region dened in Eq (15.14), we conclude that
there is no evidence to reject H0 in favor of the alternative. The data does
not contradict the hypothesis.
Alternatively, we could compute the p-value associated with this test statistic (for example, using the cumulative probability feature of MINITAB):
P (z > 0.094 or z < 0.094) = P (|z| > 0.094) = 0.925
(15.15)
Hypothesis Testing
563
very high at 0.925. Thus, there is no evidence in this data set to justify rejecting H0 . From a dierent perspective, note that this p-value is nowhere close
to being lower than the prescribed signicance level, = 0.05; we therefore
fail to reject the null hypothesis at this signicance level.
The ideas illustrated by this example can now be generalized. As with
previous discussions in Chapter 14, we organize the material according to the
status of the population standard deviation, , because whether it is known
or not determines what sampling distribution and hence test statistic is
appropriate.
15.3.1
(15.18)
for the upper-tailed, one-sided (or one-tailed) alternative hypothesis; or, nally,
as illustrated above,
Ha : = 0
(15.19)
for the two-sided (or two-tailed) alternative.
Assumptions: The underlying distribution in question is Gaussian, with
is
known standard deviation, , implying that the sampling distribution of X
2
variance, /n, if H0 is true. Hence, the
also Gaussian, with mean, 0 , and
0 )/(/ n) has a standard normal distribution,
random variable Z = (X
N (0, 1).
Test statistic: The appropriate test statistic is therefore
Z=
0
X
/ n
(15.20)
564
Random Phenomena
0.4
f(z)
0.3
0.2
0.1
D
0.0
-z
D
0
z
FIGURE 15.3: The standard normal variate z = z with tail area probability . The
shaded portion is the rejection region for a lower-tailed test, Ha : < 0
The specic value obtained for a particular sample data average, x, is sometimes called the z-score of the sample data.
Critical/Rejection Regions:
(i) For lower-tailed tests (with Ha : < 0 ), reject H0 in favor of Ha if:
z < z
(15.21)
where z is the value of the standard normal variate, z, with a tail area
probability of ; i.e., P (z > z ) = . By symmetry, P (z < z ) = P (z >
z ) = , as shown in Fig 15.3. The rationale is that if = 0 is true, then
it is highly unlikely that z will be less than z by pure chance alone; it is
more likely that is systematically less than 0 if z is less than z .
(ii) For upper-tailed tests (with Ha : > 0 ), reject H0 in favor of Ha if
(see Fig 15.4):
(15.22)
z > z
(iii) For two-sided tests, (with Ha : = 0 ), reject H0 in favor of Ha if:
z < z/2 or z > z/2
(15.23)
+ =
2
2
(15.24)
Hypothesis Testing
565
0.4
H0
f(z)
0.3
0.2
0.1
D
0.0
0
z
FIGURE 15.4: The standard normal variate z = z with tail area probability . The
shaded portion is the rejection region for an upper-tailed test, Ha : > 0
0.4
f(z)
0.3
0.2
0.1
D
D
0.0
-z D
0
Z
z D
FIGURE 15.5: Symmetric standard normal variates z = z/2 and z = z/2 with
identical tail area probabilities /2. The shaded portions show the rejection regions for
a two-sided test, Ha : = 0
566
Random Phenomena
TABLE 15.2:
Summary of H0
conditions for the one-sample z-test
For general
Testing Against Reject H0 if:
Ha : < 0
z < z
rejection
For = 0.05
Reject H0 if:
z < 1.65
Ha : > 0
z > z
z < 1.65
Ha : = 0
z < z/2
or
z > z/2
z < 1.96
or
z > 1.96
(15.25)
= 0.084
(15.26)
z=
2.5/ 50
For this two-sided test, the critical value to the right, z/2 , for = 0.05,
is:
(15.27)
z0.025 = 1.96
so that the critical/rejection region, RC , is z > 1.96 to the right, in conjunction with z < 1.96 to the left, by symmetry (recall Eq (15.14)).
Hypothesis Testing
567
And now, because the specic value z = 0.084 does not lie in the critical/rejection region, we nd no evidence to reject H0 in favor of the alternative. We conclude therefore that YB is very likely well-characterized
by the postulated distribution.
We could also compute the p-value associated with this test statistic
P (z < 0.084 or z > 0.084) = P (|z| > 0.084) = 0.933
(15.28)
Using MINITAB
It is instructive to walk through the typical procedure for carrying out such
z-tests using computer software, in this case, MINITAB. From the MINITAB
drop down menu, the sequence Stat > Basic Statistics > 1-Sample Z
opens a dialog box that allows the user to carry out the analysis either using data already stored in MINITAB worksheet columns or from summarized
data. Since we already have summarized data, upon selecting the Summarized data option, one enters 50 into the Sample size: dialog box, 72.47
into the Mean: box, and 2.5 into the Standard deviation: box; and upon
slecting the Perform hypothesis test option, one enters 72.5 for the Hypothesized mean. The OPTIONS button allows the user to select the condence
level (the default is 95.0) and the Alternative for Ha : with the 3 available options displayed as less than, not equal, and greater than. The
MINITAB results are displayed as follows:
One-Sample Z
Test of mu = 72.5 vs not = 72.5
The assumed standard deviation = 2.5
N
Mean SE Mean
95% CI
Z
P
50 72.470
0.354
(71.777, 73.163) -0.08 0.932
This output links hypothesis testing directly with estimation (as we anticipated in Chapter 14, and as we discuss further
below) as follows: SE
Mean is the standard error of the mean (/ n) from which the 95% condence interval (shown in the MINITAB output as 95% CI) is obtained
as (71.777, 73.163). Observe that the hypothesized mean, 72.5, is contained
within this interval, with the implication that, since, at the 95% condence
level, the estimated average encompasses the hypothesized mean, we have no
reason to reject H0 at the signicance level of 0.05. The z statistic computed
568
Random Phenomena
(15.29)
(15.30)
Hypothesis Testing
569
secs, on average, with a standard deviation of 125 secs. Experimental tests conducted in an aliated toxicology laboratory in which pellets were made with a newly developed formulation and administered
to 64 rats (selected at random from an essentially identical population). The results showed an average acting time, x
= 1028 secs. The
ACME scientists, anxious to declare a breakthrough, were preparing
to approach management immediately to argue that the observed excess 28 secs, when compared to the stipulated standard deviation of
125 seconds, is small and insignicant. The group statistician, in an
attempt to present an objective, statistically-sound argument, recommended instead that a hypothesis test should rst be carried out to
rule out the possibility that the mean acting time is still greater than
1000 secs. Assuming that the acting time measurements are normally
distributed, carry out an appropriate hypothesis test and, at the significance level of = 0.05, make an informed recommendation regarding
the tested rat poisons acting time.
Solution:
For this problem, the null and alternative hypotheses are:
H0 :
1000
Ha :
>
1000
(15.31)
The alternative has been chosen this way because the concern is that
the acting time may still be greater than 1000 secs. As a result of the
normality assumption, and the fact that is specied as 125, the required test is the z-test, where the specic z-score, from Eq (15.20), in
this case is:
1028 1000
= 1.792
(15.32)
z=
125/ 64
The critical value, z , for = 0.05 for this upper-tailed one-sided test
is:
(15.33)
z0.05 = 1.65,
obtained from MINITAB using the inverse cumulative probability feature for the standard normal probability distribution with tail area probability 0.05, i.e.,
P (Z > 1.65) = 0.05
(15.34)
Thus, the rejection region, RC , is z > 1.65. And now, because z = 1.78
falls into the rejection region, the decision is to reject the null hypothesis at the 5% level. Alternatively, the p-value associated with this test
statistic can be obtained (also from MINITAB, using the cumulative
probability feature) as:
P (z > 1.792) = 0.037,
(15.35)
570
Random Phenomena
Thus, from these equivalent perspectives, the conclusion is that the
experimental evidence does not support the ACME scientists premature declaration of a breakthrough; the observed excess 28 secs, in fact,
appears to be signicant at the = 0.05 signicance level.
Using the procedure illustrated previously, the MINITAB results for this
problem are displayed as follows:
One-Sample Z
Test of mu = 1000 vs > 1000
The assumed standard deviation = 125
N
Mean SE Mean 95% Lower Bound
Z
P
64 1028.0
15.6
1002.3
1.79 0.037
Observe that the z- and p- values agree with what we had obtained earlier;
furthermore, the additional entries, SE Mean, for the standard error of
the mean, 15.6, and the 95% lower bound on the estimate for the mean,
1002.3, link this hypothesis test to interval estimation. This connection will
be explored more fully later in this section; for now, we note simply that the
95% lower bound on the estimate for the mean, 1002.3, lies entirely to the right
of the hypothesized mean value of 1000. The implication is that, at the 95%
condence level, it is more likely that the true mean is higher than the value
hypothesized; we are therefore more inclined to reject the null hypothesis in
favor of the alternative, at the signicance level 0.05.
15.3.2
When the population standard deviation, , is unknown, the sample standard deviation, s, will have to be substituted for it. In this case, one of two
things can happen:
1. If the sample size is suciently large (for example, n > 30), s is usually
considered to be a good enough approximation to , that the z-test can
be applied, treating s as equal to .
2. When the sample size is small, substituting s for changes the test
statistic and the corresponding test, as we now discuss.
For small sample sizes, when S is substituted for , the appropriate test
statistic, becomes
0
X
(15.36)
T =
S/ n
which, from our discussion of sampling distributions, is known to possess a
Students t-distribution, with = n 1 degrees of freedom. This is the small
sample size equivalent of Eq (15.20).
Once more, because of the test statistic, and the sampling distribution
upon which the test is based, this test is known as a t-test. Therefore,
Hypothesis Testing
571
TABLE 15.3:
Summary of H0
rejection conditions for the
one-sample t-test
For general
Testing Against Reject H0 if:
Ha : < 0
t < t ()
Ha : > 0
t > t ()
Ha : = 0
t < t/2 () or
t > t/2 ()
( = n 1)
(15.37)
572
Random Phenomena
with a sample standard deviation, sA = 4.85; the specic T statistic
value is thus obtained as:
t=
69.0 75.0
= 3.91
4.85/ 10
(15.38)
Because this is a lower-tailed, one-sided test, the critical value, t0.05 (9),
is obtained as 1.833 (using MINITABs inverse cumulative probability
feature, for the t-distribution with 9 degrees of freedom). The rejection
region, RC , is therefore t < 1.833. Observe that the specic t-value
for this test lies well within this rejection region; we therefore reject the
null hypothesis in favor of the alternative, at the signicance level 0.05.
Of course, we could also compute the p-value associated with this
particular test statistic; and from the t-distribution with 9 degrees of
freedom we obtain,
P (T (9) < 3.91) = 0.002
(15.39)
(15.40)
From the supplied data, the sample average and standard deviation are
obtained respectively as x
B = 74.0, and sB = 5.40, so that the specic
value for the T statistic is:
t=
74 75.0
= 0.59
5.40/ 10
(15.41)
Since this is a two-tailed test, the critical values, t0.025 (9) and its mirror
image t0.025 (9), are obtained from MINITAB as:2.26 and 2.26 implying that the critical/rejection region, RC , in this case is t < 2.26
or t > 2.26. But the specic value for the t-statistic (0.59) does not
lie in this region; we therefore do not reject H0 at the signicance level
0.05.
The associated p-value, obtained from a t-distribution with 9 degrees
of freedom, is:
P (t(9) < 0.59 or t(9) > 0.59) = P (|t(9)| > 0.59) = 0.572
(15.42)
with the implication that we do not reject the null hypothesis, either
on the basis of the p-value, or else at the 0.05 signicance level, since
p = 0.572 is larger than 0.05.
Hypothesis Testing
573
Thus, observe that with these two t-tests, we have established, at a signicance
level of 0.05, that the mean score obtained by trainees using method A is less
than 75 while the mean score for trainees using method B is essentially equal
to 75. We can, of course, infer from here that this means that method B must
be more eective. But there are more direct methods for carrying out tests to
compare two means directly, which will be considered shortly.
Using MINITAB
MINITAB can be used to carry out these t-tests directly (without having to
compute, by ourselves, rst the test statistic and then the critical region, etc).
After entering the data into separate columns, Method A and Method B
in a MINITAB worksheet, for the rst problem, the sequence Stat > Basic
Statistics > 1-Sample t from the MINITAB drop down menu opens a dialog box where one selects the column containing the data, (Method A);
and upon selecting the Perform hypothesis test option, one enters the appropriate value for the Hypothesized mean (75) and with the OPTIONS
button one selects the desired Alternative for Ha (less than) along with the
default condence level (95.0).
MINITAB provides three self-explanatory graphical options: Histogram
of data; Individual value plot; and Boxplot of data. Our discussion in
Chapter 12 about graphical plots for small sample data sets recommends that,
with n = 10 in this case, the box plot is more reasonable than the histogram
for this example.
The resulting MINITAB outputs are displayed as follows:
One-Sample T: Method A
Test of mu = 75 vs < 75
95% Upper
Variable N Mean StDev SE Mean
Bound
T
P
Method A 10 69.00 4.85
1.53
71.81
-3.91 0.002
The box-plot along with the 95% condence interval estimate and the
hypothesized mean H0 = 75 are shown in Fig 15.6. The conclusion to reject
the null hypothesis in favor of the alternative is clear.
In dealing with the second problem regarding Method B, we follow the
same procedure, selecting data in the Method B column, but this time, the
Alternative is selected as not equal. The MINITAB results are displayed
as follows:
One-Sample T: Method B
Test of mu = 75 vs not = 75
Variable N Mean StDev SE Mean
95% CI
T
P
Method B 10 74.00 5.40
1.71
(70.14, 77.86) -0.59 0.572
The box-plot along with the 95% condence interval for the mean and the
hypothesized mean H0 = 75 are shown in Fig 15.7.
574
Random Phenomena
Boxplot of Method A
(with Ho and 95% t-confidence interval for the mean)
_
X
Ho
60
62
64
66
68
Method A
70
72
74
76
FIGURE 15.6: Box plot for Method A scores including the null hypothesis mean,
, and the 95% condence interval
H0 : = 75, shown along with the sample average, x
based on the t-distribution with 9 degrees of freedom. Note how the upper bound of the
95% condence interval lies to the left of, and does not touch, the postulated H0 value
Boxplot of Method B
(with Ho and 95% t-confidence interval for the mean)
_
X
Ho
65
70
75
Method B
80
85
FIGURE 15.7: Box plot for Method B scores including the null hypothesis mean,
, and the 95% condence interval
H0 , = 75, shown along with the sample average, x
based on the t-distribution with 9 degrees of freedom. Note how the the 95% condence
interval includes the postulated H0 value
Hypothesis Testing
15.3.3
575
576
Random Phenomena
level. Again, this is illustrated in part 1 of Example 15.4. The upper bound
of the 95% condence interval on the average method A score was obtained
as 71.81, which is lower than the postulated average of 75, thereby triggering
the rejection of H0 in favor of Ha , at the 0.05 signicance level (see Fig 15.6).
Conversely, when the hypothesized value, 0 , is to the left of this upper bound,
we will fail to reject H0 at the 0.05 signicance level.
15.4
(15.43)
(15.44)
when the dierence between the two population means is postulated as some
value 0 , and the hypothesis is to be tested against the usual triplet of possible
alternatives:
Lower-tailed Ha : 1 2 < 0
(15.45)
Upper-tailed Ha : 1 2 > 0
Two-tailed Ha : 1 2 = 0
(15.46)
(15.47)
15.4.1
Hypothesis Testing
577
TABLE 15.4:
Summary of H0 rejection
conditions for the two-sample z-test
For general For = 0.05
Reject H0 if: Reject H0 if:
Testing Against
Ha : 1 2 < 0 z < z
z < 1.65
Ha : 1 2 > 0
z > z
z < 1.65
Ha : 1 2 = 0
z < z/2 or
z > z/2
z < 1.96 or
z > 1.96
)
1 X
(X
% 2 2 2 0 N (0, 1)
1
2
n1 + n2
(15.48)
where n1 and n2 are the sizes of the samples drawn from populations 1 and
2 respectively. This fact arises from the result established in Chapter 14 for
=X
1 X
2 as N (, v 2 ), with as dened in
the sampling distribution of D
Eq (18.10), and
2
2
(15.49)
v2 = 1 + 2
n1
n2
Tests based on this statistic are known as two-sample z-tests, and as with
previous tests, the specic results for testing H0 : 1 2 = 0 are summarized
in Table 15.4.
Let us illustrate the application of this test with the following example.
Example 15.5: COMPARISON OF SPECIALTY AUXILIARY
BACKUP LAB BATTERY LIFETIMES
A company that manufactures specialty batteries used as auxiliary backups for sensitive laboratory equipments in need of constant power supplies claims that its new prototype, brand A, has a longer lifetime (under constant use) than the industry-leading brand B, and at the same
cost. Using accepted industry protocol, a series of tests carried out in
an independent laboratory produced the following results: For brand A:
1 = 647 hrs; with a population
sample size, n1 = 40; average lifetime, x
standard deviation given as 1 = 27 hrs. The corresponding results for
2 = 638; 2 = 31. Determine, at the 5% level, if
brand B are n2 = 40; x
there is a signicant dierence between the observed mean lifetimes.
Solution:
Observe that in this case, 0 = 0, i.e., the null hypothesis is that the two
means are equal; the alternative is that 1 > 2 , so that the hypotheses
are formulated as:
H0 : 1 2 = 0
Ha : 1 2 > 0
(15.50)
578
Random Phenomena
The specic test statistic obtained from the experimental data is:
z=
(647 638) 0
%
= 1.38
2
272
+ 31
40
40
(15.51)
For this one-tailed test, the critical value, z0.05 , is 1.65; and now, since
the computed z-score is not greater than 1.65, we cannot reject the
null hypothesis. There is therefore insucient evidence to support the
rejection of H0 in favor of Ha , at the 5% signicance level.
Alternatively, we could compute the p-value and obtain:
p = P (Z > 1.38)
1 P (Z < 1.38)
1 0.916 = 0.084
(15.52)
Once again, since this p-value is greater than 0.05, we cannot reject H0
in favor of Ha , at the 5% signicance level. (However, observe that at
the 0.1 signicance level, we will reject H0 in favor of Ha , since the
p-value is less than 0.1.)
15.4.2
In most practical cases, it is rare that the two population standard deviations are known. Under these circumstances, we are able to identify three
distinct cases requiring dierent approaches:
1. 1 and 2 unknown; large sample sizes n1 and n2 ;
2. Small sample sizes; 1 and 2 unknown, but equal (i.e., 1 = 2 );
3. Small sample sizes; 1 and 2 unknown, and unequal (i.e., 1 = 2 ).
As usual, under the rst set of conditions, the sample standard deviations,
s1 and s2 , are considered to be suciently good approximations to the respective unknown population parameters; they are then used in place of 1
and 2 in carrying out the two-sample z-test as outlined above. Nothing more
need be said about this case. We will concentrate on the remaining two cases
where the sample sizes are considered to be small.
Equal standard deviations
When the two population standard deviations are considered as equal, the
test statistic:
1 X
2 ) 0
(X
t()
(15.53)
T = % 2
Sp
Sp2
+
n1
n2
i.e., its sampling distribution is a t-distribution with degrees of freedom,
with
(15.54)
= n1 + n2 2
Hypothesis Testing
579
TABLE 15.5:
Summary of H0
rejection conditions for the two-sample
t-test
For general
Reject H0 if:
Testing Against
Ha : 1 2 < 0 t < t ()
Ha : 1 2 > 0
t > t ()
Ha : 1 2 = 0
t < t/2 () or
t > t/2 ()
( = n1 + n2 2)
(15.56)
580
Random Phenomena
From the sample data, we obtain all the quantities required for comB = 74.0; the
puting the test statistic: the sample means, x
A = 69.0, x
sample standard deviations, sA = 4.85, sB = 5.40; so that the estimated
pooled standard deviation is obtained as:
sp = 5.13
with = 18. To test the observed dierence (d = 69.0 74.0 = 5.0)
against a hypothesized dierence of 0 = 0 (i.e., equality of the means),
we obtain the t-statistic as:
t = 2.18,
which is compared to the critical value for a t-distribution with 18 degrees of freedom,
t0.05 (18) = 1.73.
And since t < t0.05 (18), we reject the null hypothesis in favor of the
alternative, and conclude that, at the 5% signicance level, the evidence
in the data supports the claim that Method B is more eective.
Note also that the associated p-value, obtained from a t distribution
with 18 degrees of freedom, is:
P (t(18) < 2.18) = 0.021
(15.57)
Using MINITAB
This just-concluded example illustrates the mechanics of how to conduct
a two-sample t-test manually; once the mechanics are understood, however,
it is recommended to use computer programs such as MINITAB.
As noted before, once the data sets have been entered into separate
columns Method A and Method B in a MINITAB worksheet (as was the
case in the previous Example 15.4), the required sequence from the MINITAB
drop down menu is: Stat > Basic Statistics > 2-Sample t, which opens
a dialog box with self-explanatory options. Once the location of the relevant
data are identied, the Assume equal variance box is selected in this case,
and with the OPTIONS button, one selects the Alternative for Ha (less
than, if the hypotheses are set up as we have done above), along with the
default condence level (95.0); one enters the value for hypothesized dierence, 0 , in the Test dierence box (0 in this case). The resulting MINITAB
outputs for this problem are displayed as follows:
Two-Sample T-Test and CI: Method A, Method B
Two-sample T for Method A vs Method B
N Mean StDev SE Mean
Method A 10 69.00 4.85
1.5
Method B 10 74.00 5.40
1.7
Hypothesis Testing
581
(15.59)
with n
12 dened by the formidable-looking expression
S 2 /n + S 2 /n 2
1
2
2
n
12 = (S 21/n )2
1 1 + (S22 /n2 )2
n1 +1
(15.60)
n2 +1
582
Random Phenomena
the dierence between the two means that does not contain the hypothesized
mean corresponds to a hypothesis test in which H0 is rejected, at the significance level of , in favor of the alternative that the computed dierence is
not equal to the hypothesized dierence. Note that with a test of equality (in
which case 0 , the hypothesized dierence, is 0), rejection of H0 is tantamount
to the (1 ) 100% condence interval for the dierence not containing 0.
On the contrary, an estimated (1)100% condence interval that contains
the hypothesized dierence is equivalent to a two-sample test that must fail
to reject H0 .
The corresponding arguments for the upper-tailed and lower-tailed tests
follow precisely as presented earlier. For an upper-tailed test, (Ha : > 0 ), a
lower bound of the (1)100% condence interval estimate of the dierence,
, that is larger than the hypothesized dierence, 0 , corresponds to a twosample test in which H0 is rejected in favor of Ha , at the signicance level
of . Conversely, a lower bound of the condence interval estimate of the
dierence, , that is smaller than the hypothesized dierence, 0 , corresponds
to a test that will not reject H0 . The reverse is the case for the lower-tailed
test (Ha : < 0 ): when the upper bound of the (1 ) 100% condence
interval estimate of is smaller than 0 , H0 is rejected in favor of Ha . When
the upper bound of the (1 ) 100% condence interval estimate of is
larger than 0 , H0 is not rejected.
An Illustrative Example: The Yield Improvement Problem
The solution to the yield improvement problem rst posed in Chapter 1,
and revisited at the beginning of this chapter, will nally be completed in
this illustrative example. In addition, the example also illustrates the use of
MINITAB to carry out a two-sample t-test when population variances are not
equal.
The following questions remain to be resolved: Is YA > YB , and if so, is
YA YB > 2? Having already conrmed that the random variables, YA and YB ,
2
can be characterized reasonably well with Gaussian distributions, N (A , A
)
2
and N (B , B ), respectively, the supplied data may then be considered as
being from normal distributions with unequal population variances. We will
answer these two questions by carrying out appropriate two-sample t-tests.
Although the answer to the rst of the two questions requires testing for
the equality of A and B against the alternative that A > B , let us begin by
rst testing against A = B ; this establishes that the two distributions means
are dierent. Later we will test against the alternative that A > B , and
thereby go beyond the mere existence of a dierence between the population
means to establish which is larger. Finally, we proceed even one step further
to establish not only which one is larger, but that it is larger by a value that
exceeds a certain postulated value (in this case 2).
For the rst test of basic equality, the hypothesized dierence is clearly
Hypothesis Testing
583
0 = 0, so that:
H0 : A B = 0
Ha : A = B = 0
(15.61)
The procedure for using MINITAB is as follows: upon entering the data
into separate YA and YB columns in a MINITAB worksheet, the required
sequence from the MINITAB drop down menu is: Stat > Basic Statistics
> 2-Sample t. In the opened dialog box, one simply selects the Samples in
dierent columns option, identies the columns corresponding to each data
set, but this time, the Assume equal variance box must not be selected.
With the OPTIONS button one selects the Alternative for Ha as not
equal, along with the default condence level (95.0); in the Test dierence
box, one enters the value for hypothesized dierence, 0 ; 0 in this case. The
resulting MINITAB outputs for this problem are displayed as follows:
Two-Sample T-Test and CI: YA, YB
Two-sample T for YA vs YB
N Mean StDev SE Mean
YA 50 75.52 1.43
0.20
YB 50 72.47 2.76
0.39
Difference = mu (YA) - mu (YB)
Estimate for difference: 3.047
95% CI for difference: (2.169, 3.924)
T-Test of difference = 0 (vs not =): T-Value = 6.92 P-Value = 0.000
DF = 73
Several points are worth noting here:
1. The most important is the p-value which is virtually zero; the implication
is that at the 0.05 signicance level, we must reject the null hypothesis
in favor of the alternative: the two population means are in fact different, i.e., the observed dierence between the population is not zero.
Note also that the t-statistic value is 6.92, a truly extreme value for a
distribution that is symmetrical about the value 0, and for which the
density value, f (t) essentially vanishes (i.e., f (t) 0), for values of the
t variate exceeding 4. The p-value is obtained as P (|T | > 6.92).
2. The estimated sample dierence is 3.047, with a 95% condence interval,
(2.169, 3.924); since this interval does not contain the hypothesized difference 0 = 0, the implication is that the test will reject H0 , as indeed
we have concluded in point #1 above;
3. Finally, even though there were 50 data entries each for YA and YB , the
degrees of freedom associated with this test is obtained as 73. (See the
expressions in Eqs (15.59) and (15.60) above.)
584
Random Phenomena
This rst test has therefore established that the means of the YA and YB
populations are dierent, at the 5% signicance level. Next, we wish to test
which of these two dierent means is larger. To do this, the hypotheses to be
tested are:
H0 : A B = 0
Ha : A > B = 0
(15.62)
The resulting outputs from MINITAB are identical to what is shown above
for the rst test, except that the 95% CI for difference line is replaced with 95% lower bound for difference: 2.313 and the T-Test of
difference = 0 (vs not =) is replaced with T-Test of difference = 0
(vs >). The t-value, p-value and DF remain the same.
Again, with a p-value that is virtually zero, the conclusion is that, at the
5% signicance level, the null hypothesis must be rejected in favor of the
alternative, which, this time, is specically that A is greater than B . Note
that the value 2.313, computed from the data as the 95% lower bound for the
dierence, is considerably higher than the hypothesized value of 0; i.e., the
hypothesized 0 = 0 lies well to the left of this lower bound for the dierence.
This is consistent with rejecting the null hypothesis in favor of the alternative,
at the 5% signicance level.
With the nal test, we wish to sharpen the postulated dierence a bit
further. This time, we assert that, A is not only greater than B ; the former
is in fact greater than the latter by a value that exceeds 2. The hypotheses
are set up in this case as follows:
H0 : A B = 2
Ha : A > B = 2
(15.63)
This time, in the MINTAB options, the new hypothesized dierence is indicated as 2 in the Test dierence box. The MINITAB results are displayed
as follows:
Two-Sample T-Test and CI: YA, YB
Two-sample T for YA vs YB
N Mean StDev SE Mean
YA 50 75.52 1.43
0.20
YB 50 72.47 2.76
0.39
Difference = mu (YA) - mu (YB)
Estimate for difference: 3.047
95% lower bound for difference: 2.313
T-Test of difference = 2 (vs >): T-Value = 2.38 P-Value = 0.010 DF
= 73
Note that the t-value is now 2.38 (reecting the new hypothesized value of
Hypothesis Testing
585
0 = 2), with the immediate consequence that the p-value is now 0.01; not
surprisingly, everything else remains the same as in the rst test. Thus, at the
0.05 signicance level, we reject the null hypothesis in favor of the alternative. Note also that the 95% lower bound for the dierence is larger than the
hypothesized dierence of 2.
The conclusion is therefore that, with 95% condence (or alternatively at
a signicance level of 0.05), the mean yield obtainable from the challenger
process A is at least 2 points larger than that obtainable by the incumbent
process B.
15.4.3
Paired Dierences
(15.64)
586
Random Phenomena
TABLE 15.6:
weight-loss program
Patient #
Before Wt (lbs)
After Wt (lbs)
Patient #
Before Wt (lbs)
After Wt (lbs)
1
272
263
11
215
206
2
319
313
12
245
235
3
253
251
13
248
237
4
325
312
14
364
350
5
236
227
15
301
288
6
233
227
16
203
195
7
300
290
17
197
193
8
260
251
18
217
216
9
268
262
19
210
202
10
276
263
20
223
214
The quantities required for the hypothesis test are: the sample average,
n
Di
D = i=1
,
(15.66)
n
(which is unbiased for ), and the sample variance of the dierences,
n
2
(Di D)
2
SD
.
(15.67)
= i=1
n1
Under these circumstances, the null hypothesis is dened as
H0 : = 0
(15.68)
(15.69)
Upper-tailed Ha : > 0
Two-tailed Ha : = 0
(15.70)
(15.71)
0
D
;
SD / n
(15.72)
Hypothesis Testing
587
TABLE 15.7:
Summary of H0
rejection conditions for the paired
t-test
For general
Testing Against Reject H0 if:
Ha : < 0
t < t ()
Ha : > 0
t > t ()
Ha : = 0
t < t/2 () or
t > t/2 ()
( = n 1)
Solution:
This problem requires determining whether the mean dierence between
the before and after weights for the 20 patients is signicantly different from zero. The null and alternative hypotheses are:
H0 : = 0
Ha : = 0
(15.73)
588
Random Phenomena
Boxplot of Differences
(with Ho and 95% t-confidence interval for the mean)
_
X
Ho
6
8
Differences
10
12
14
FIGURE 15.8: Box plot of dierences between the Before and After weights,
including a 95% condence interval for the mean dierence, and the hypothesized H0
point, 0 = 0
the signicance level of 0.05, we reject the null hypothesis and conclude
that the weight-loss program was eective. The average weight loss of 8.4
lbs is therefore signicantly dierent from zero, at the 5% signicance
level.
A box plot of the dierences between the before and after
weights is shown in Fig 15.8, which displays graphically that the null
hypothesis should be rejected in favor of the alternative. Note how far
the hypothesized value of 0 is from the 95% condence interval for the
mean weight dierence.
(15.74)
Hypothesis Testing
589
It is important to understand the sources of the failure in this last example. First, a box plot of the two data sets, shown in Fig 15.9, graphically
illustrates why the two-sample t-test is entirely unable to detect the very real,
and very signicant, dierence between the before and after weights. The
variability within the samples is so high that it swamps out the dierence between each pair which is actually signicant. But the most important reason
is illustrated in Fig 15.10, which shows a plot of before and after weights
for each patient versus patient number, from where it is absolutely clear, that
the two sets of weights are almost perfectly correlated. Paired data are often
not independent. Observe from the data (and from this graph) that without
exception, every single before weight is higher than the corresponding after weight. The issue is therefore not whether there is a weight loss; it is
a question of how much. For this group of patients, however, this dierence
cannot be detected in the midst of the large amount of variability within each
group (before or after).
These are the primary reasons that the two-sample t-test failed miserably
in identifying a dierential that is quite signicant. (As an exercise, the reader
should obtain a scatter plot of the before weight versus the after weight
to provide further graphical evidence of just how correlated the two weights
are.)
590
Random Phenomena
380
360
340
320
Data
300
280
260
240
220
200
Before WT
After WT
FIGURE 15.9: Box plot of the Before and After weights including individual data
means. Notice the wide range of each data set
380
Variable
Before WT
After WT
360
340
Weight
320
300
280
260
240
220
200
2
10
12
Patient #
14
16
18
20
FIGURE 15.10: A plot of the Before and After weights for each patient. Note
how one data sequence is almost perfectly correlated with the other; in addition note the
relatively large variability intrinsic in each data set compared to the dierence between
each point
Hypothesis Testing
15.5
591
Determining , the Type II error risk, and hence (1 ), the power of any
hypothesis test, depends on whether the test is one- or two-sided. The same
is also true of the complementary problem: the determination of experimental
sample sizes required to achieve a certain pre-specied power. We begin our
discussion of such issues with the one-sided test, specically the upper-tailed
test, with the null hypothesis as in Eq (15.16) and the alternative in Eq
(15.18). The results for the lower-tailed, and the two-sided tests, which follow
similarly, will be given without detailed derivations.
15.5.1
and Power
To determine (and hence power) for the upper-tailed test, it is not sucient merely to state that > 0 ; instead, one must specify a particular value
for the alternative mean, say a , so that:
Ha : = a > 0
(15.75)
is the alternative hypothesis. The Type II error risk is therefore the probability
of failing to reject the null hypothesis when in truth the data came from
the alternative distribution with mean a (where, for the upper-tailed test,
a > 0 ).
The dierence between this alternative and the postulated null hypothesis
distribution mean,
(15.76)
= a 0
is the margin by which the null hypothesis is falsied in comparison to the
alternative. As one might expect, the magnitude of will be a factor in how
easy or dicult it is for the test to detect, amidst all the variability in the
data, a dierence between H0 and Ha , and therefore correctly reject H0 when
it is false. (Equivalently, the magnitude of will also factor into the risk of
incorrectly failing to reject H0 in favor of a true Ha .)
As shown earlier, if H0 is true, then the distribution of the sample mean,
is N (0 , 2 /n), so that the test statistic, Z, in Eq (15.20), possesses a
X,
standard normal distribution; i.e.,
Z=
0
X
N (0, 1)
/ n
(15.77)
592
Random Phenomena
H0; N(0,1)
0.4
Ha; N(2.5,1)
f(z)
0.3
0.2
1/2
*n
G
V
0.1
0.0
-4
-2
1.65 2
FIGURE 15.11: Null and alternative hypotheses distributions for upper-tailed test
based on n = 25 observations, with population standard deviation = 4, where the
true alternative mean, a , exceeds the hypothesized one by = 2.0. The gure shows a
z-shift of ( n)/ = 2.5; and with reference to H0 , the critical value z0.05 = 1.65.
The area under the H0 curve to the right of the point z = 1.65 is = 0.05, the
signicance level; the area under the dashed Ha curve to the left of the point z = 1.65
is
ZN
,1
/ n
(15.78)
n
(15.79)
zshif t =
n
=z
(15.80)
Hypothesis Testing
593
from where we obtain the expression for the power of the test as:
n
(1 ) = 1 P z < z
(15.81)
(15.82)
(15.83)
Thus, for the illustrative example test given above, based on 25 observations, with = 4 and a 0 = = 2.0, the -risk and power are obtained
as
Power
(15.84)
15.5.2
Sample Size
In the same way in which z was dened earlier, let z be the standard
normal variate such that
(15.85)
P (z > z ) =
so that, by symmetry,
P (z < z ) =
Then, from Eqs (15.82) and (15.86) we obtain :
n
z = z
n
z + z =
(15.86)
(15.87)
(15.88)
which relates the - and -risk variates to the three hypothesis test characteristics: , the hypothesized mean shift to be detected by the test (the
signal); , the population standard deviation, a measure of the variability
inherent in the data (the noise); and nally, n, the number of experimental
observations to be used to carry out the hypothesis test (the sample size).
(Note that these three terms comprise what we earlier referred to as the zshift, the precise amount by which the standardized Ha distribution has been
shifted away from the H0 distribution: see Fig 15.11.)
594
Random Phenomena
0.4
f(x)
0.3
Ha; N(2.5,1)
0.2
0.1
0.198
0.0
1.65
2.5
X
0.4
Ha; N(2.5,1)
f(x)
0.3
0.802
0.2
0.1
0.0
1.65
2.5
X
FIGURE 15.12: and power values for hypothesis test of Fig 15.11 with Ha
N (2.5, 1). Top:; Bottom: Power = (1 )
Hypothesis Testing
595
(15.89)
z =
/ n
so that:
xC = z + 0
(15.90)
n
By denition of , under Ha ,
xC a
=P z<
(15.91)
/ n
and from the denition of the z variate in Eq (15.86), we obtain:
z =
xC a
/ n
(15.92)
n
, or
z = z
n
(15.93)
z + z =
596
Random Phenomena
3. The only way to reduce either risk simultaneously (which will require increasing the total sum of the risk variates) is by increasing the z-shift.
This is achievable most directly by increasing n, the sample size, since
neither , the population standard deviation, nor , the hypothesized
mean shift to be detected by the test, is usually under the direct control
of the experimenter.
This last point leads directly to the issue of determining how many experimental samples are required to attain a certain power, given basic test
characteristics. This question is answered by solving Eq (15.88) explicitly for
n to obtain:
2
(z + z )
(15.94)
n=
Thus, by specifying the desired - and -risks along with the test characteristics, , the hypothesized mean shift to be detected by the test, and
, the population standard deviation, one can use Eq (15.94) to determine
the sample size required to achieve the desired risk levels. In particular, it
is customary to specify the risks as = 0.05 and = 0.10, in which case,
z = z0.05 = 1.645; and z = z0.10 = 1.28. Eq (15.94) then reduces to:
n=
2.925
2
(15.95)
(15.96)
Hypothesis Testing
597
Practical Considerations
In practice, prior to performing the actual hypothesis test, no one knows
whether or not Ha is true compared to H0 talk less of knowing the precise
amount by which a will exceed the postulated 0 if Ha turns out to be true.
The implication therefore is that is never known in an objective fashion
a-priori. In determining the power of a hypothesis test, therefore, is treated
`
not as known but as a design parameter : the minimum dierence we would
like to detect, if such a dierence exists. Thus, is to be considered properly as the magnitude of the smallest dierence we wish to detect with the
hypothesis test.
In a somewhat related vein, the population standard deviation, , is rarely
known `
a priori in many practical cases. Under these circumstances, it has
often been recommended to use educated guesses, or results from prior experiments under similar circumstances, to provide pragmatic surrogates for
. We strongly recommend an alternative approach: casting the problem in
terms of the signal-to-noise ratio (SNR):
SN =
(15.97)
n =
n =
(z + z )
SN
2
2.925
SN
2
(15.98)
598
Random Phenomena
TABLE 15.8:
in Fig 15.11 and Example 15.8, SN = 2/4 = 0.5; from this Table 15.8,
the required sample size, 35, is precisely as obtained in Example 15.8.
15.5.3
For the sake of clarity, the preceding discussion was specically restricted
to the upper-tailed test. Now that we have presented and illustrated the essential concepts, it is relatively straightforward to extend them to other types
of tests without having to repeat the details.
First, because the sampling distribution for the test statistic employed for
these hypothesis tests is symmetric, it is easy to see that with the lower-tailed
alternative
(15.99)
Ha : = a < 0
this time, with
= 0 a ,
(15.100)
n
= P z > z +
(15.101)
Hypothesis Testing
the equivalent of Eq (15.94). When P z < z/2
imation,
2
(z/2 + z )
n
599
, the approx(15.103)
15.5.4
600
Random Phenomena
Alpha = 0.05 Assumed standard deviation = 4
Sample
Difference Size
Power
2
25
0.803765
This computed power value is what we had obtained earlier.
(2) When the power is specied and the sample size removed, the
MINITAB result is:
Power and Sample Size
1-Sample Z Test
Testing mean = null (versus > null)
Calculating power for mean = null + difference
Alpha = 0.05 Assumed standard deviation = 4
Sample Target
Difference
Size
Power Actual Power
2
35
0.9
0.905440
This is exactly the same sample size value and the same actual power
value we had obtained earlier.
(3) With n specied as 35 and the dierence unspecied, the
MINITAB result is:
Power and Sample Size
1-Sample Z Test
Testing mean = null (versus > null)
Calculating power for mean = null + difference
Alpha = 0.05 Assumed standard deviation = 4
Sample Target
Difference
Size
Power Difference
2
35
0.9
1.97861
The implication is that any dierence greater than 1.98 can be detected
at the desired power. A dierence of 2.0 is therefore detectable at a
power that is at least 0.9.
These results are all consistent with what we had obtained earlier.
15.6
Hypothesis Testing
601
TABLE 15.9:
Summary of H0
rejection conditions for the 2 -test
For general
Testing Against Reject H0 if:
Ha : 2 < 02
c2 < 21 (n 1)
Ha : 2 > 02
c2 > 2 (n 1)
Ha : 2 = 02
c2 < 21/2 (n 1)
or
c2 > 2/2 (n 1)
15.6.1
Single Variance
(n 1)S 2
02
(15.105)
602
Random Phenomena
0.10
f(x)
0.08
Chi-Square, df=9
0.06
0.04
0.02
D
0.00
F
D
16.9
0.10
0.08
f(x)
Chi-Square, df=9
0.06
0.04
0.02
0.00
D
F
D
3.33
FIGURE 15.13: Rejection regions for one-sided tests of a single variance of a normal
population, at a signicance level of = 0.05, based on n = 10 samples. The distribution
is 2 (9); Top: for Ha : 2 > 02 , indicating rejection of H0 if c2 > 2 (9) = 16.9; Bottom:
for Ha : 2 < 02 , indicating rejection of H0 if c2 < 21 (9) = 3.33
Hypothesis Testing
603
2
A
= 1.52
Ha :
2
A
= 1.52 ,
(15.106)
49 2.05
= 44.63
2.25
s2A
= 2.05, so
(15.107)
The rejection region for this two-sided test, with = 0.05, is shown
in Fig 15.14, for a 2 (49) distribution. The boundaries of the rejection
region are obtained from the usual cumulative probabilities; the left
boundary is obtained by nding 21/2 such that
P (c2 > 21/2 (49))
2
or P (c <
0.975
21/2 (49))
0.025
21/2
31.6
i.e.,
(15.108)
or P (c <
0.025
2/2 (49))
0.975
2/2
70.2
i.e.,
(15.109)
Since the value for c above does not fall into this rejection region, we
do not reject the null hypothesis.
As before, MINITAB could be used directly to carry out this test.
The self-explanatory procedure follows along the same lines as those
discussed extensively above.
The conclusion: at the 5% signicance level, we cannot reject the
2
.
null hypothesis concerning A
15.6.2
Two Variances
(15.110)
S12
S22
(15.111)
604
Random Phenomena
0.04
f(x)
0.03
Chi-Square, df=49
0.02
0.01
0.025
0.00
0.025
31.6
70.2
FIGURE 15.14: Rejection regions for the two-sided tests concerning the variance of
2
= 1.52 , based on n = 50 samples, at a signicance
the process A yield data H0 : A
level of = 0.05. The distribution is 2 (49), with the rejection region shaded; because
the test statistic, c2 = 44.63, falls outside of the rejection region, we do not reject H0 .
1
F (2 , 1 )
(15.112)
TABLE 15.10:
Summary of H0
rejection conditions for the F -test
For general
Testing Against Reject H0 if:
Ha : 12 < 22
f < F1 (1 , 2 )
Ha : 12 > 22
f > F (1 , 2 )
Ha : 12 = 22
f < F1/2 (1 , 2 )
or
f > F/2 (1 , 2 )
Hypothesis Testing
605
(15.113)
From the supplied data, we obtain s2A = 2.05, and s2B = 7.62, so that
the specic value for the F -test statistic is obtained as:
f=
2.05
= 0.27
7.62
(15.114)
The rejection region for this two-sided F -test, with = 0.05, is shown in
Fig 15.15, for an F (49, 49) distribution, with boundaries at f = 0.567
to the left and 1.76 to the right, obtained as usual from cumulative
probabilities. (Note that the value of f at one boundary is the reciprocal
of the value at the other boundary.) Since the specic test value, 0.27,
falls in the left side of the rejection region, we must therefore reject the
null hypothesis in favor of the alternative that these two variances are
unequal.
The self-explanatory procedure for carrying out the test in
MINITAB generates results that include a p-value of 0.000, in agreement with the conclusion above to reject the null hypothesis at the 5%
signicance level.
The F -test is particularly useful for ascertaining whether or not the assumption of equality of variances is valid before performing a two-sample t-test. If
the null hypothesis regarding the equality assumption is rejected, then one
must not use the equal variance option of the test. If one is unable to reject
the null hypothesis, one may proceed to use the equal variance option. As
discussed in subsequent chapters, the F -test is also at the heart of ANOVA
(ANalysis Of VAriance), a methodology that is central to much of statistical design of experiments and the systematic analysis of the resulting data
statistical tests involving several means, and even regression analysis.
Finally, we note that the F -test is quite sensitive to the normality assumption: if this assumption is invalid, the test results will be unreliable. Note that
the assumption of normality is not about the mean of the data but about
606
Random Phenomena
1.6
1.4
1.2
f(x)
1.0
0.8
F, df1=49, df2=49
0.6
0.4
0.2
0.0
0.025
0.025
0.567
1.76
FIGURE 15.15: Rejection regions for the two-sided tests of the equality of the vari-
2
2
= B
, at a signicance
ances of the process A and process B yield data, i.e., H0 : A
level of = 0.05, based on n = 50 samples each. The distribution is F (49, 49), with
the rejection region shaded; since the test statistic, f = 0.27, falls within the rejection
region to the left, we reject H0 in favor of Ha .
the raw data set itself. One must therefore be careful to ensure that this normality assumption is reasonable before carrying out an F -test. If the data is
from non-normal distributions, most computer programs provide alternatives
(based on non-parametric methods).
15.7
Concerning Proportions
Hypothesis Testing
607
15.7.1
(15.116)
Ha : p > p 0
(15.117)
Ha : p = p0 .
(15.118)
and
To determine an appropriate test statistic and its sampling distribution,
we need to recall several characteristics of the binomial random variable from
Chapter 8. First, the estimator, , dened as:
=
X
,
n
(15.119)
the mean number of successes, is unbiased for the binomial population parameter; the mean of the sampling distribution for is therefore p. Next, the
608
Random Phenomena
TABLE 15.11:
Summary of H0 rejection
conditions for the single-proportion z-test
For general For = 0.05
Testing Against Reject H0 if: Reject H0 if:
Ha : p < p 0
z < z
z < 1.65
Ha : p > p 0
z > z
z < 1.65
Ha : p = p0
z < z/2
or
z > z/2
z < 1.96
or
z > 1.96
2
variance of is X
/n2 , where
2
X
= npq = np(1 p)
(15.120)
p(1 p)
n
(15.121)
p0
p0 (1 p0 )/n
(15.123)
a test statistic with precisely the same properties as those used for the standard z-test. The rejection conditions are identical to those shown in Table
15.2, which, when modied appropriately for the one-proportion test, is as
shown in Table 15.11.
Since this test is predicated upon the sample being suciently large, it
is important to ensure that this is indeed the case. A generally agreed upon
objective criterion for ascertaining the validity of this approximation is that
the interval
(15.124)
I0 = p0 3 [p0 (1 p0 )]/n
does not include 0 or 1. The next example illustrates these concepts.
Hypothesis Testing
609
0.8
Ha : p
=
0.8;
(15.125)
0.75 0.8
= 1.25
(0.8 0.2)/100
(15.126)
Since this value does not lie in the two-sided rejection region for =
0.05, we do not reject the null hypothesis.
610
Random Phenomena
Exact Tests
Even though it is customary to invoke the normal approximation in dealing with tests for single proportions, this is in fact not necessary. The reason
is quite simple: if X Bi(n, p), then = X/n has a Bi(n, p/n) distribution. This fact can be used to compute the probability that = p0 , or any
other value providing the means for determining the boundaries of the
various rejection regions, (given desired tail area probabilities), just as with
the standard normal distribution, or any other standardized test distribution.
Computer programs such as MINITAB provide options for obtaining exact
p-values for the single proportion test that are based on exact binomial distributions.
When MINITAB is used to carry out the test in Example 15.13 above,
this time without invoking the normal approximation option, the result is as
follows:
Test and CI for One Proportion
Test of p = 0.8 vs p not = 0.8
Sample X
N Sample p
1
75 100 0.750000
Exact
95% CI
P-Value
(0.653448, 0.831220)
0.260
The 95% condence interval, which is now based on a binomial distribution, not a normal approximation, is now slightly dierent; the p-value, is also
now slightly dierent, but the conclusion remains the same.
15.7.2
(15.127)
where 1 = X1 /n1 and 2 = X2 /n2 are, respectively, the random proportions of successes obtained from population 1 and population 2, based on
samples of respective sizes n1 and n2 . For example, 1 could be the fraction of
defective chips in a sample of n1 chips manufactured at one facility whose true
proportion of defectives is p1 , while 2 is the defective fraction contained in
a sample from a dierent facility. The dierence between the two population
proportions is postulated as some value 0 that need not be zero.
As usual, the hypothesis is to be tested against the possible alternatives:
Lower-tailed Ha : 1 2 < 0
(15.128)
Upper-tailed Ha : 1 2 > 0
Two-tailed Ha : 1 2 = 0
(15.129)
(15.130)
Hypothesis Testing
611
D = p1 p2
&
p1 q1
p2 q2
+
n1
n2
(15.132)
(15.133)
But now, if the sample sizes n1 and n2 are large, then it can be shown that
2
)
D N (D , D
(15.134)
again allowing us to invoke the normal approximation (for large sample sizes).
This immediately implies that the following is an appropriate test statistic to
use for this two-proportion test:
(1 2 ) 0
Z = )
N (0, 1)
p1 q1
p2 q2
+
n1
n2
(15.135)
x1
x2
; and p2 =
n1
n2
(15.136)
Finally, since this test statistic possesses a standard normal distribution, the
rejection regions are precisely the same as those in Table 15.4.
In the special case when 0 = 0, which is equivalent to a test of equality of
the proportions, the most important consequence is that if the null hypothesis
is true, then p1 = p2 = p, which is then estimated by the pooled proportion:
p =
x1 + x2
n1 + n2
(15.137)
(15.139)
612
Random Phenomena
Example 15.14: REGIONAL PREFERENCE FOR PEPSI
To conrm persistent rumors that the preference for PEPSI on engineering college campuses is higher in the Northeast of the United States
than on comparable campuses in the Southeast, a survey was carried out
on 125 engineering students chosen at random on the MIT campus in
Cambridge, MA, and the same number of engineering students selected
at random at Georgia Tech in Atlanta, GA. Each student was asked
to indicate a preference for PEPSI versus other soft drinks, with the
following results: 44 of the 125 at MIT indicate preference for PEPSI
versus 26 at GA Tech. At the 5% level, determine whether the Northeast proportion, p1 = 0.352, is essentially the same as the Southeast
proportion, p2 = 0.208, against the alternative that they are dierent.
Solution:
The hypotheses to be tested are:
H0 : 1 2
Ha : 1 2
=
(15.140)
and from the given data, the test statistic computed from Eq (15.139)
is z = 2.54. Since this number is greater than 1.96, and therefore lies in
the rejection region of the two-sided test, we reject the null hypothesis
in favor of the alternative. Using MINITAB to carry out this test, selecting the "use pooled estimate of p for test," produces the following result:
Test and
Sample
1
2
Hypothesis Testing
15.8
613
The discussion in the previous section has opened up the issue of testing
hypotheses about non-Gaussian populations, and has provided a strategy for
handling such problems in general. The central issue is nding an appropriate
test statistic and its sampling distribution, as was done for the binomial distribution. This cause is advanced greatly by the relationship between interval
estimates and hypothesis tests (discussed earlier in Section 15.3.3) and by the
discussion at the end of Chapter 14 on interval estimates for non-Gaussian
distributions.
15.8.1
First, if the statistical hypothesis is about the mean of a non-Gaussian population, so long as the sample size, n, used to compute the sample average,
is reasonably large (e.g. n > 30 or so), then, regardless of the underlying
X,
)/X possesses an apdistribution, we known that the statistic Z = (X
proximate standard normal distribution an approximation that improves
as n . Thus, hypotheses about the means of non-Gaussian populations
that are based on large sample sizes are essentially the same as z-tests.
Example 15.15: HYPOTHESIS TEST ON MEAN OF INCLUSIONS DATA
If the data in Table 1.2 is considered a random sample of 60 observations
of the number of inclusions found on glass sheets produced in the manufacturing process discussed in Chapter 1, test at the 5% signicance
level, the hypothesis that this data came from a Poisson population with
mean = 1, against the alternative that is not 1.
Solution:
The hypotheses to be tested are:
H0 :
Ha :
=
(15.141)
While the data is from a Poisson population, the sample size is large;
hence, the test statistic:
0
X
(15.142)
Z=
/ 60
where is the standard deviation of the raw data (so that / 60 is the
standard deviation of the sample average), essentially has a standard
normal distribution.
=x
From the supplied data, we obtain the sample average
= 1.02,
with the sample standard deviation, s = 1.1, which, because of the large
sample, will be considered to be a reasonable approximation of . The
614
Random Phenomena
test statistic is therefore obtained as z = 0.141. Since this value is not in
the two-sided rejection region |z| > 1.96 for = 0.05, we do not reject
the null hypothesis. We therefore conclude that there is no evidence to
contradict the statement that X P(1), i.e., the inclusions data is from
a Poisson population with mean number of inclusions = 1.
It is now important to recall the results in Example 14.13 where the 95%
condence interval estimate for the mean of the inclusions data was obtained
as:
(15.143)
= 1.02 1.96(1.1/ 60) = 1.02 0.28
i.e., 0.74 < < 1.30. Note that this interval contains the hypothesized value
= 1.0, indicating that we cannot reject the null hypothesis.
We can now use this result to answer the following question raised in Chapter 1 as a result of the potentially disturbing data obtained from the quality
control lab apparently indicating too many glass sheets with too many inclusions: if the process was designed to produce glass sheets with a mean number
of inclusions = 1 per m2 , is there evidence in this sample data that the
process has changed, that the number of observed inclusions is signicantly
dierent from what one can reasonably expect from the process when operating
as designed?
From the results of this example, the answer is, No: at the 5% signicance
level, there no evidence that the process has deviated from its design target.
15.8.2
When the sample size on which the sample average is based is small, or
when we are dealing with aspects of the population other than the mean, (say
the variance), we are left with only one option: go back to rst principles,
derive the sampling distribution for the appropriate statistic and use it to
carry out the required test. One can use the sampling distribution to determine
100% rejection regions, or the complementary region, the (1 )100%
condence interval estimates for the appropriate parameter.
For tests involving single parameters, it makes no dierence which of these
two approaches we choose; for tests involving two parameters, however, it is
more straightforward to compute condence intervals for the parameters in
question and then use these for the hypothesis test. The reason is that for tests
involving two parameters, condence intervals can be computed directly from
the individual sampling distributions; on the other hand, computing rejection
regions for the dierence between these two parameters technically requires an
additional step of deriving yet another sampling distribution for the dierence.
And the sampling distribution of the dierence between two random variables
may not always be easy to derive. Having discussed earlier in this chapter the
equivalence between condence intervals and hypothesis tests, we now note
that for non-Gaussian problems, one might as well just base the hypotheses
tests on (1 )100% condence intervals and avoid the additional hassle of
Hypothesis Testing
615
having to derive distributions for dierences. Let us illustrate this concept with
a problem involving the exponential random variable discussed in Chapter 14.
In Example 14.3, we presented a problem involving an exponential random
variable, the waiting time (in days) until the occurrence of a recordable safety
incident in a certain companys manufacturing site. The safety performance
data for the rst and second years were presented, from which point estimates
of the unknown population parameter, , were determined from the sample
2 = 32.9 days for Year 2; the sample
averages, x
1 = 30.1 days, for Year 1 and x
size in each case is n = 10, which is considered small.
To test the two-sided hypothesis that these two safety performance parameters (Year 1 versus Year 2) are the same, versus the alternative that they are
signicantly dierent (at the 5% signicance level), we proceed as follows: we
2 given that X E(); we
1 and X
rst obtain the sampling distribution for X
then use these to obtain 95% condence interval estimates for the population
means i for Year i; if these intervals overlap, then at the 5% signicance
level, we cannot reject the null hypothesis that these means are the same; if
the intervals do not overlap, we reject the null hypothesis.
Much of this, of course, was already accomplished in Example 14.14:
i /i
i has a gamma distribution, more specically, X
we showed that X
(n, 1/n), from where we obtain 95% condence intervals estimates for i from
sample data. In particular, for n = 10, we obtained from the Gamma(10,0.1)
distribution that:
X
< 1.71 = 0.95
(15.144)
P 0.48 <
(15.145)
(15.146)
616
Random Phenomena
15.9
In its broadest sense, a likelihood ratio (LR) test is a technique for assessing
how well a simpler, restricted version of a probability model compares to its
more complex, unrestricted version in explaining observed data. Within the
context of this current chapter, however, the discussion here will be limited to
testing hypotheses about the parameters, , of a population characterized by
the pdf f (x, ). Even though based on fundamentally dierent premises, some
of the most popular tests considered above (the z- and t-tests, for example)
are equivalent to LR tests under recognizable conditions.
15.9.1
General Principles
Let X be a random variable with the pdf f (x, ), where the population
parameter vector ; i.e., represents the set of possible values that the
parameter vector can take. Given a random sample, X1 , X2 , . . . , Xn , estimation theory, as discussed in Chapter 14, is concerned with using such sample
information to determine reasonable estimates for . In particular, we recall
that the maximum likelihood (ML) principle requires choosing the estimate,
ML , as the value of that maximizes the likelihood function:
(15.147)
the joint pdf of the random sample, treated as a function of the unknown
population parameter.
The same random sample and the same ML principle can be used to test
the null hypotheses
(15.148)
H0 : 0
Hypothesis Testing
617
(15.149)
since the variance is given and the only unknown parameter is the mean; 0 ,
the restricted parameter set range over which H0 is conjectured to be valid,
is dened as:
0 = {(1 , 2 ) : 1 = 0 = 75; 2 = 2 = 1.52 }
(15.150)
(15.151)
again stated in a general fashion in which the parameter set, a , is (a) disjoint
from 0 , and (b) also complementary to it, in the sense that
= 0 a
(15.152)
(15.153)
Note that the union of this set with 0 in Eq (15.150) is the full parameter
set range, in Eq (15.149).
Now, dene the largest likelihood under H0 as
L (0 ) = max L()
0
(15.154)
max
0 a
L (0 )
L ()
L()
(15.155)
(15.156)
618
Random Phenomena
Any value of less than this critical value will trigger rejection of H0 .
Likelihood ratio tests are very general; they can be used even for cases
involving structurally dierent H0 and Ha probability distributions, or for
random variables that are correlated. While the form of the pdf for that is
appropriate for each case may be quite complicated, in general, it is always
possible to perform the required computations numerically using computer
programs. Nevertheless, there are many special cases for which closed-form
analytical expressions can be derived directly either for f (| 0 ), the pdf of
itself, or else for the pdf of a monotonic function of . See Pottmann et al.,
(2005)2 , for an application of the likelihood ratio test to an industrial sensor
data analysis problem.
2 Pottmann, M., B. A. Ogunnaike, and J. S. Schwaber, (2005). Development and Implementation of a High-Performance Sensor System for an Industrial Polymer Reactor, Ind.
Eng. Chem. Res., 44, 2606-2620.
Hypothesis Testing
15.9.2
619
Special Cases
(15.159)
L(, ) =
(15.160)
1
2
n/2
1
exp
n
n
i=1 (xi
2 2
2
X)
(15.161)
L (0 ) =
1
2
n/2
1
exp
n
n
i=1 (xi
2 2
0 )2
(15.162)
(15.163)
(X
0 )]2 so that:
Upon rewriting (xi 0 )2 as [(xi X)
n
i=1
(xi 0 )2 =
n
2 + n(X
0 )2
(xi X)
i=1
(15.164)
620
Random Phenomena
(15.165)
To proceed from here, we need the pdf for the random variable, ; but rather
than confront this challenge directly, we observe that:
2
X 0
= Z2
(15.166)
2 ln =
/ n
where Z, of course, is the familiar z-test statistic
X 0
Z=
/ n
(15.167)
with a standard normal distribution, N (0, 1). Thus the random variable,
= 2 ln , therefore has a 2 (1) distribution. From here it is now a
straightforward exercise to obtain the rejection region in terms of not , but
= 2 ln (or Z 2 ). For a signicance level of = 0.05, we obtain from tail
area probabilities of the 2 (1) distribution that
P (Z 2 3.84) = 0.05
(15.168)
(15.169)
Upon taking square roots, being careful to retain both positive as well as
negative values, we obtain the familiar rejection conditions for the z-test:
n(X 0 )
< 1.96 or
n(X 0 )
> 1.96
(15.170)
The LR test under these conditions is therefore exactly the same as the z-test.
(15.171)
along with,
= 0 a = {(1 , 2 ) : < 1 = < ; 2 = 2 > 0}
(15.172)
Hypothesis Testing
621
2 2
but this time both parameters are unknown, even though the hypothesis test is
on alone. As a result, the function is maximized by the maximum likelihood
and,
estimators for both , and 2 respectively, the sample average, X,
n
2
(x1 X)
2 = i=1
n
as obtained in Chapter 14.
The unrestricted maximum value, L (), in this case is obtained by introducing these ML estimators for the respective unknown parameters in Eq
(15.160) and rearranging to obtain:
n/2
n
n
en/2
(15.173)
L () =
2
2 i=1 (xi X)
When the parameters are restricted to 0 , this time, the likelihood function is maximized, after substituting = 0 , by the MLE for 2 , so that the
largest likelihood value is obtained as:
n/2
n
n
en/2
(15.174)
L (0 ) =
2 i=1 (xi 0 )2
Thus, the likelihood ratio statistic becomes:
n
2
(xi X)
= ni=1
2
i=1 (xi 0 )
n/2
(15.175)
And upon employing the sum-of-squares identity in Eq (15.164), and simplifying, we obtain:
n/2
1
=
(15.176)
2
X
0)
1 + n(
n
2
(x
X)
i
i=1
n
2 /(n 1), this
If we now introduce the sample variance S 2 = i=1 (xi X)
expression is easily rearranged to obtain:
$n/2
#
1
(15.177)
=
2
1 n(X
0)
1 + n1
S2
As before, to proceed from here, we need to obtain the pdf for the random
variable, . However, once again, we recognize a familiar statistic embedded
in Eq (15.177), i.e.,
2
X 0
T2 =
(15.178)
S/ n
622
Random Phenomena
1
1 + T 2 /
(15.179)
From here we observe that because 2/n (and hence ) is a strictly monotonically decreasing function of T 2 in Eq (15.179), then the rejection region
< c for which say P ( < c ) = , is exactly equivalent to a rejection
region T 2 > t2c , for which,
(15.180)
P (T 2 > t2c ) =
Once more, upon taking square roots, retaining both positive as well as negative values, we obtain the familiar rejection conditions for the t-test:
0 )
(X
S/ n
0 )
(X
S/ n
<
t/2 () or
>
t/2 ()
(15.181)
15.9.3
Hypothesis Testing
623
This asymptotic result is exactly equivalent to the large sample approximation to the sampling distribution of means of arbitrary populations. Note
that in the second special case (Gaussian distribution with unknown variance),
contains two unknown parameter, and 2 , while 0 contains only one
unknown parameter, 2 . The asymptotic distribution of = 2 ln will then
also be 2 (1), in precisely the same sense in which t() N (0, 1).
15.10
Discussion
This chapter should not end without bringing to the readers attention
some of the criticisms of certain aspects of hypothesis testing. The primary issues have to do not so much with the mathematical foundations of the methodology as with the implementation and interpretation of the results in practice.
Of several controversial issues, the following are three we wish to highlight:
1. Point null hypothesis and statistical-versus-practical signicance: When
the null hypothesis about a population parameter is that = 0 , where
0 is a point on the real line, such a literal mathematical statement, can
almost always be proven false with computations carried to a sucient
number of decimal places. For example, if 0 = 75.5, a large enough
sample that generates x = 75.52 (a routine possibility even when the
population parameter is indeed 75.5) will lead to the rejection of H0 , to
two decimal places. However, in actual practice (engineering or science),
is the distinction between two real numbers 75.5 and 75.52 truly of
importance? i.e., is the statement 75.5 = 75.52, which is true in the
strictest, literal mathematical sense, meaningful in practice? Sometimes
yes, sometime no; but the point is that such null hypotheses can almost
always be falsied, raising the question: what then does rejecting H0
really mean?
2. Borderline p-values and variability: Even when the p-value is used to
determine whether or not to reject H0 , it is still customary to relate the
computed p-value to some value of , typically 0.05. But what happens
for p = 0.06, or p = 0.04? Furthermore, an important fact that often
goes unnoticed is that were we to repeat the experiment in question,
the new data set will almost always lead to results that are dierent
from those obtained earlier; and consequently the new p-value will also
be dierent from that obtained earlier. One cannot therefore rule out
the possibility of a borderline p-value switching sides purely as a
result of intrinsic variability in the data.
3. Probabilistic interpretations: From a more technical perspective, if
represents the observed discrepancy between the observed postulated
624
Random Phenomena
population parameter and the value determined from data (a realization
of the random variable, ), the p-value (or else the actual signicance
level of the test) is dened as P ( |H0 ); i.e., the probability of
observing the computed dierence or something more extreme if the null
hypothesis is true. In fact, the probability we should be interested in is
the reverse: P (H0 | ), i.e., the probability that the null hypothesis
is true given the evidence in the data, which truly measures how much
the observed data supports the proposed statement of H0 . These two
conditional probabilities are generally not the same.
In light of these issues (and others we have not discussed here), how should one
approach hypothesis testing in practice? First, statistical signicance should
not be the only factor in drawing conclusions from experimental results
the nature of the problem at hand should be taken into consideration as well.
The yield from process A may in fact not be precisely 75.5% (after all, the
probability that a random variable will take on a precise value on the real
line is exactly zero), but 75.52% is suciently close that the dierence is of
no practical consequence. Secondly, one should be careful in basing the entire
decision about experimental results on a single hypothesis test, especially
with p-values at the border of the traditional = 0.05. A single statistical
hypothesis test of data obtained in a single study is just that: it can hardly be
considered as having denitively conrmed something. Thirdly, decisions
based on condence intervals around the estimated population parameters
tend to be less confusing and are more likely to provide the desired solution
more directly.
Finally, the reader should be aware of the existence of other recently proposed alternatives to conventional hypothesis testing, e.g. Jones and Tukey
(2000)3 , or Killeen (2005)4. These techniques are designed to ameliorate some
of the problems discussed above, but any discussions on them, even of the
most cursory type, lie outside of the intended scope of this chapter. Although
not yet as popular as the classical techniques discussed here, they are worth
exploring by the curious reader.
15.11
If the heart of statistics is inferencedrawing conclusions about populations from information in a samplethen this chapter and Chapter 14 jointly
constitute the heart of Part IV of this book. Following the procedures dis3 L. V. Jones, and J. W. Tukey, (2000): A Sensible Formulation of the Signicance Test,
Psych. Methods, 5 (4) 411-414
4 P.R. Killeen, (2005): An Alternative to Null-Hypothesis Signicance Tests, Psychol
Sci, 16(5), 345-353
Hypothesis Testing
625
626
Random Phenomena
Hypothesis Testing
627
REVIEW QUESTIONS
1. What is a statistical hypothesis?
2. What dierentiates a simple hypothesis from a composite one?
3. What is H0 , the null hypothesis, and what is Ha , the alternative hypothesis?
4. What is the dierence between a two-sided and a one-sided hypothesis?
5. What is a test of a statistical hypothesis?
6. How is the US legal system illustrative of hypothesis testing?
7. What is a test statistic?
8. What is a critical/rejection region?
9. What is the denition of the signicance level of a hypothesis test?
10. What are the types of errors to which hypothesis tests are susceptible, and what
are their legal counterparts?
11. What is the -risk, and what is the -risk?
12. What is the power of a hypothesis test, and how is it related to the -risk?
13. What is the sensitivity of a test as opposed to the specicity of a test?
14. How are the performance measures, sensitivity and specicity, related to the
-risk, and the -risk?
15. What is the p-value, and why is it referred to as the observed signicance level ?
16. What is the general procedure for carrying out hypothesis testing?
17. What test statistic is used for hypotheses concerning the single mean of a normal
population when the variance is known?
18. What is a z-test?
19. What is an upper-tailed test as opposed to a lower-tailed test?
20. What is the one-sample z-test?
21. What test statistic is used for hypotheses concerning the single mean of a normal
628
Random Phenomena
Hypothesis Testing
629
39. What test statistic is used in the large sample approximation test concerning a
single population proportion?
40. What is the objective criterion for ascertaining the validity of the large sample
assumption in tests concerning a single population proportion?
41. What is involved in exact tests concerning a single population proportion?
42. What test statistic is used for hypotheses concerning two population proportions?
43. What is the central issue in testing hypotheses about non-Gaussian populations?
44. How does sample size inuence how hypotheses about non-Gaussian populations
are tested?
45. What options are available when testing hypotheses about non-Gaussian populations with small samples?
46. What are likelihood ratio tests?
47. What is the likelihood ratio test statistic?
48. Why is the likelihood ratio parameter such that 0 < < 1? What does a
value close to zero indicate? And what does a value close to 1 indicate?
49. Under what condition does the likelihood ratio test become identical to the familiar z-test?
50. Under what condition does the likelihood ratio test become identical to the familiar t-test?
51. What is the asymptotic distribution result for the likelihood ratio statistic?
52. What are some criticisms of hypothesis testing highlighted in this chapter?
53. In light of some of the criticisms discussed in this chapter, what recommendations
have been proposed for approaching hypothesis testing in practice?
EXERCISES
Section 15.2
15.1 The target mooney viscosity of the elastomer produced in a commercial process is 44.0; if the average mooney viscosity of product samples acquired from
the process hourly and analyzed in the quality control laboratory exceeds or falls
630
Random Phenomena
below this target, the process is deemed out of control and in need of corrective
control action. Formulate the decision-making about the process performance as a
hypothesis test, stating the null and the alternative hypotheses.
15.2 A manufacturer of energy-saving light bulbs wants to establish that the lifetime
of its new brand exceeds the specication of 1,000 hours. State the appropriate null
and alternative hypotheses.
15.3 A pharmaceutical company wishes to show that its newly developed acne medication reduces teenage acne by an average of 55% in the rst week of usage. What
are the null and alternative hypotheses?
15.4 The owner of a eet of taxi cabs wants to determine if there is a dierence in
the lifetime of two dierent brands of car batteries used in the eet of cabs. State
the appropriate null and alternative hypotheses.
15.5 The safety coordinator of a manufacturing facility wishes to demonstrate that
the mean time (in days) between safety incidents has deteriorated from the traditional 30 days. What are the appropriate null and alternative hypotheses?
15.6 Consider X1 , X2 , . . . , Xn , a random sample from a normal population with a
postulated mean 0 but known variance;
(i) If a test is based on the following criterion
0
X
> 1.65
P
/ n
what is (a) the type of hypothesis being tested; (ib) the test statistic; and (c) the
signicance level?
(ii) If instead, the variance is unknown, and the criterion is changed to:
0
X
> 2.0 = 0.051
P
s/ n
what is n?
15.7 Given a sample of size n = 15 from a normal distribution with unknown mean
and variance, a t-test statistic of 2.0 was determined for a one-sided, upper-tailed
test. Determine the associated p-value. From a dierent sample of unknown size
drawn from the same normal population, the rejection region for a test at the signicance level of = 0.05 was determined as t > 1.70. What is n?
15.8 Consider a random sample, X1 , X2 , . . . , Xn from an exponential population,
possesses a gamma (n, /n) distribuE (). It is known that the sample mean, X,
tion. A hypothesis test regarding the population parameter is to be determined
using the following criterion:
X
< CR = 0.95
P CL <
For this problem, what is (i) the type of hypothesis being tested; (ii) the test statistic; (iii) the signicance level; and (iv) if n = 20, the rejection region?
Hypothesis Testing
631
Section 15.3
15.9 A random sample of size n = 16 drawn from a normal population with hypothesized mean 0 = 50 and known variance, 2 = 25, produced a sample average
x
= 47.5.
(i) Compute the appropriate test statistic.
(ii) If the alternative hypothesis is Ha : < 0 , at the = 0.1 signicance level,
determine the rejection region. Should the null hypothesis H0 : = 0 be rejected?
15.10 Refer to Exercise 15.9. Determine the 95% condence interval estimate for
the population mean and compare it with the hypothesized mean. What does this
imply about whether or not the null hypothesis H0 : = 0 should be rejected at
the = 0.05 level? Determine the p-value associated with this test.
15.11 Refer to Exercise 15.9. If instead the population variance is unknown and the
sample variance is obtained as s2 = 39.06, should the null hypothesis H0 : = 0
be rejected at the = 0.1 level, and the = 0.05 level?
15.12. The following random sample was obtained from a normal population with
variance given as 1.00.
SN = {9.37, 8.86, 11.49, 9.57, 9.15, 9.10, 10.26, 9.87, 7.82, 10.47}
To test the hypothesis that the population mean is dierent from a postulated value
of 10.00, (i) state the appropriate null and alternative hypotheses; (ii) determine the
appropriate test statistic; (iii) determine the rejection region for a hypothesis test
at the = 0.05 signicance level; (iv) determine the associated p-value. (v) What
conclusion should be drawn from this test?
15.13 Refer to Exercise 15.12. Repeat for the case where the population variance is
unknown. Does this fact change the conclusion of the test?
15.14 A random sample of size n = 50 from a normal population with = 3.00
produced a sample mean of 80.05. At a signicance level = 0.05,
(i) Test the null hypothesis that the population mean 0 = 75.00 against the alternative that 0 > 75.00; interpret the result of the test.
(ii) Test the null hypothesis against the alternative 0 = 75.00. Interpret the result
of this test and compare it to the test in (i). Why are the results dierent?
15.15 In carrying out a hypothesis test of H0 : = 100 versus the alternative,
Ha : > 100, given that the population variance is 1600, it has been recommended
to reject H0 in favor of Ha if the mean of a random sample of size n = 100 exceeds
106. What is , the signicance level behind the statement?
Section 15.4
15.16 Random samples of 50 observations are each drawn from two independent
represents the sample
normal populations, N (10.0, 2.52 ), and N (12.0, 3.02 ); if X
mean from the rst population, and Y is the sample mean from the second population,
and Y ;
(i) Determine the sampling distribution for X
(ii) Determine the mean and variance of the sampling distribution for Y X;
632
Random Phenomena
(iii) Determine the z-statistic associated with actual sample averages obtained as
x
= 10.9 and y = 11.8. Use this to test H0 : Y = X against Ha : Y > X .
(iv) Since we know from the supplied population information that Y > X , interpret the results of the test in (iii).
15.17 Two samples of sizes n1 = 20 and n2 = 15 taken from two independent normal populations with known standard deviations, 1 = 3.5 and 2 = 4.2, produced
2 = 13.8. At the = 0.05 signicance level, test
sample averages, x
1 = 15.5 and x
the null hypothesis that the means are equal against the alternative that they are
not. Interpret the result. Repeat this test for Ha : 1 > 2 ; interpret the result.
15.18 The data in the table below is a random sample of 15 observations each from
two normal populations with unknown means and variances. Test the null hypothesis
that the two population means are equal against the alternative that Y > X . First
assume that the two population variances are equal. Interpret your results. Repeat
the test without assuming equal variances. Is there a dierence in the conclusions?
Sample
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
X
12.03
13.01
9.75
11.03
5.81
9.28
7.63
5.70
11.75
6.28
12.53
10.22
7.17
11.36
9.16
Y
13.74
13.59
10.75
12.95
7.12
11.38
8.69
6.39
12.01
7.15
13.47
11.57
8.81
13.10
11.32
15.19 Refer to Exercise 15.18. (i) On the same graph, plot the data for X and for Y
against sample number. Comment on any feature that might indicate whether the
two samples can be treated as independent or not.
(ii) Treat the samples as 15 paired observations and test the null hypothesis that
the two population means are equal against the alternative that Y > X . Interpret
your result and compare it with the results of Exercise 15.18.
15.20 The data below are random samples from two independent lognormal distributions; specically, XL1 L(0, 0.25) and XL2 L(0.25, 0.25).
Hypothesis Testing
XL1
0.81693
0.96201
1.03327
0.84046
1.06731
1.34118
0.77619
1.14027
1.27021
1.69466
633
XL2
1.61889
1.15897
1.17163
1.09065
1.27686
0.91838
1.45123
1.47800
2.16068
1.46116
(i) For the time being, ignore the fact that the sample size is too small to make
the normal approximation valid for the sampling distribution of the sample means.
At the = 0.05 signicance level, carry out a two-sample t-test concerning the
equality of the means of these two populations, against the alternative that they
are not equal. Interpret your results in light of the fact that we know that the two
populations are not equal.
(ii) Fortunately, a logarithmic transformation of lognormally distributed data yields
normally distributed data; as a result, let Y1 = ln XL1 and Y2 = ln XL2 and repeat
(i) for the log transformed Y1 and Y2 data. Interpret your results.
(iii) Comment on the implication of these results on the inappropriate use of the
normal approximation as well as the use of = 0.05 in a dogmatic fashion.
= 38.8 and Y = 42.4 were obtained from ran15.21 The sample averages X
dom samples taken from two independent populations of respective sizes nx = 120
and ny = 90. The corresponding sample standard deviations were obtained as
s2x = 20; s2y = 35. At the = 0.05 signicance level, test the hypothesis that the population mean Y is greater than X . Interpret your result. How will the result change
if instead, the hypothesis to be tested is that the two population means are dierent?
Section 15.5
15.22 A random sample of size n = 100 from a normal population with unknown
mean and variance is to be used to test the null hypothesis, H0 : = 12, versus the
alternative, H0 : = 12. The observed sample standard deviation is s = 0.5.
(i) Determine the rejection region for = 0.1, 0.05 and = 0.01.
(ii) If the true population mean has shifted to = 11.9, determine the value of
corresponding to each of the rejection regions obtained in (i) and hence the power
of each test. Comment on the eect that lowering has on the corresponding values
of and power.
15.23 In the following, given = 0.05 and = 1.0, determine the missing hypothesis test characteristic parameter for a two-sided, 1-sample z-test:
(i) Power = 0.9, sample size = 40; and Power = 0.9, sample size = 20.
(ii) Power = 0.75, sample size = 40; and Power = 0.75, sample size = 20.
(iii) Power = 0.9, hypothesized mean shift to be detected = 0.5, and Power = 0.9,
hypothesized mean shift to be detected = 0.75.
(iv) Power = 0.75, hypothesized mean shift to be detected = 0.5, and Power = 0.75,
hypothesized mean shift to be detected = 0.75.
(v) Hypothesized mean shift to be detected = 0.5, sample size = 40; and Hypothe-
634
Random Phenomena
ln y
ln x
2.925
SN
2
Show that the relative sensitivity of the sample size n to the signal-to-noise ratio
SN = / is 2, thus establishing that a 1% increase in the signal-to-noise ratio
SN translates to an (instantaneous) incremental reduction of 2% in sample size
requirements. Comment on ways by which one might increase signal-to-noise ratio
in practical problems.
Section 15.6
15.28 The variance of a sample of size n = 20 drawn from a normal population
with mean 100 was obtained as s2 = 9.5. At the = 0.05 signicance level, test the
Hypothesis Testing
635
636
Random Phenomena
2.84
0.67
1.66
0.12
0.41
2.72
3.26
6.51
3.75
5.22
1.78
4.05
2.16
16.65
1.31
1.52
de(i) On the basis of the exact sampling distribution of the sample mean, X,
termine a 95% condence interval estimate of the population parameter, .
(ii) Test the hypothesis H0 : = 4 versus the alternative H0 : = 4. What is the
p-value associated with this test?
(iii) Repeat the test in (ii) using the normal approximation (which is not necessarily
valid in this case). Compare this test result with the one in (ii).
15.42 Refer to Exercise 15.41. Using the normal approximation, test the hypothesis
that the sample variance is 16 versus the alternative that it is not, at the = 0.05
Hypothesis Testing
637
(n 1)S 2
02
should be used, where S 2 is the sample variance. Hence, establish that the likelihood
ratio test for the variance of a single normal population is identical to the result obtained in Section 15.6.
15.45 Let X1 , X2 , . . . , Xn be a random sample from an exponential population E ().
Establish that the likelihood ratio test of the hypothesis that = 0 versus the
alternative that = 0 will result in a rejection region obtained from the solution
to the following inequality:
x
x/0
e
k
0
where x
is the observed sample average, and k is a constant.
APPLICATION PROBLEMS
15.46 In a study to determine the performance of processing machines used to add
raisins to trial-size Raisin Bran cereal boxes, (see Example 12.3 in Chapter 12),
6 sample boxes are taken at random from each machines production line and the
number of raisins in each box counted. The result for machines 3 and 4 are shown
below. Assume that these can be considered as random samples from a normal population.
638
Random Phenomena
Machine 3
13
7
11
9
12
18
Machine 4
7
4
7
7
12
18
(i) If the target average number of raisins dispensed per box is 10, by carrying
out appropriate hypothesis tests determine which of the two machines is operating
according to target and which is not. State the p-value associated with each test.
(ii) Is there any signicant dierence between the mean number of raisins dispensed
by these two machines? Support your answer adequately.
15.47 The data table below shows the wall thickness (in ins) of cast aluminum
cylinder heads used in aircraft engine cooling jackets, taken from Mee (1990)5 .
0.223
0.201
0.228
0.223
0.214
0.224
0.193
0.231
0.223
0.237
0.213
0.217
0.218
0.204
0.215
0.226
0.233
0.219
Hypothesis Testing
639
(HCB) model8 and their new (KG) model. The results are shown in the table below.
Viscosity, (105 P a.s)
Experimental
HCB
KG
Data
Predictions Predictions
2.740
2.718
2.736
2.569
2.562
2.575
2.411
2.429
2.432
2.504
2.500
2.512
3.237
3.205
3.233
3.044
3.025
3.050
2.886
2.895
2.910
2.957
2.938
2.965
3.790
3.752
3.792
3.574
3.551
3.582
3.415
3.425
3.439
3.470
3.449
3.476
(i) Treated as paired data, perform an appropriate hypothesis test to compare the
new KG model predictions with corresponding experimental results. Is there evidence to support the claim that this model provides excellent agreement with
experimental data?
(ii) Treated as paired data, test whether there is any signicant dierence between
the HCB model predictions and the new KG model predictions.
(iii) As in (i) perform a test to assess the performance of the classic HCB model
prediction against experimental data. Can the HCB model be considered as also
providing excellent agreement with experimental data?
15.50 The table below, from Lucas (1985)9 , shows the number of accidents occurring
per quarter (three months), over a 10-year period, at a DuPont company facility.
The data set is divided into two periods: Period I for the rst ve-year period of the
study; Period II, for the second ve-year period.
5
4
2
5
6
Period I
5 10 8
5 7
3
8 6
9
6 5 10
3 3 10
3
1
7
1
4
Period II
4 2 0
3 2 2
7 1 4
2 2 1
4 4 4
(i) Perform appropriate tests to conrm or refute the hypothesis that the true
population mean number of accidents in the rst period is 6, while the same population parameter was halved in the second period.
(ii) Separately test the hypothesis that there is no signicant dierence between the
mean number of accidents in each period. State any assumptions needed to answer
these questions.
8 Hirschfelder J.O., C.F. Curtiss, and R.B. Bird (1964). Molecular Theory of Gases and
Liquids. 2nd printing. J. Wiley, New York, NY.
9 Lucas J. M., (1985). Counted Data CUSUMs, Technometrics, 27, 129144.
640
Random Phenomena
15.51 A survey of alumni that graduated between 2000 and 2005 from the chemical engineering department of a University in the Mid-Atlantic region of the US
involved 150 randomly selected individuals: 100 BS graduates and 50 MS graduates.
(The PhD graduates participated in a dierent survey.) The survey showed, among
other things, that 9.5% of BS graduates and 4.5% of MS graduates were unemployed
for at least one year during this period.
(i) If the corresponding national unemployment averages for all BS and MS degree
holders in all engineering disciplines over the same period are, respectively, 15.2%
and 7.5%, perform appropriate hypothesis tests to determine whether or not the
chemical engineering alumni of this University fare better in general than graduates
with corresponding degrees in other engineering disciplines.
(ii) Does having an advanced degree make any dierence in the unemployment status of the alumni of this University? Support your answer adequately.
(iii) In connection with (ii) above, if it is desired to determine any true dierence
between the unemployment status of this Universitys alumni to within 0.5% with
95% condence, how many alumni would have to be sampled? State any assumptions clearly.
15.52 The data set in Problems 1.13 and 14.42, shown in the table below for ease of
reference, is the time (in months) from receipt to publication of 85 papers published
in the January 2004 issue of a leading chemical engineering research journal.
19.2
9.0
17.2
8.2
4.5
13.5
20.7
7.9
19.5
8.8
18.7
7.4
9.7
13.7
8.1
8.4
10.8
15.1
5.3
12.0
3.0
18.5
5.8
6.8
14.5
3.3
11.1
16.4
7.3
7.4
7.3
5.2
10.2
3.1
9.6
12.9
17.3
6.0
24.3
21.3
19.3
2.5
9.1
8.1
9.8
15.4
15.7
8.2
8.8
7.2
12.8
4.2
4.2
7.8
9.5
3.9
8.7
5.9
5.3
1.8
10.1
10.0
18.7
5.6
3.3
7.3
11.3
2.9
5.4
15.2
8.0
11.7
17.2
4.0
3.8
7.4
5.3
10.6
15.2
11.5
5.9
20.1
12.2
12.0
8.8
(i) Using an appropriate probability model for the population from which the data
is a random sample, obtain a precise 95% condence interval for the mean of this
population; use this interval estimate to test the hypothesis by the Editor-in-Chief
that the mean time-to-publication is 9 months, against the alternative that it is
higher.
(ii) Considering n = 85 as a large enough sample size for a normal approximation
for the distribution of the sample mean, repeat (i) carrying out an appropriate
one-sample test. Compare your result with that in (i). How good is the normal
approximation?
(iii) Use the normal approximation to test the hypothesis that the mean time to
Hypothesis Testing
641
0
1
2
1
0
0
0
0
0
1
1
0
2
0
2
0
2
0
0
0
0
0
1
0
0
0
2
0
0
0
1
1
1
0
1
0
0
0
0
0
1
1
0
1
(i) If the company proposes to use as a safety performance interval (SPI), the
statistic
3 X
SI = X
compute this interval from the supplied sample data.
(ii) The utility of the SPI is that any observation falling outside the upper bound is
deemed to be indicative of a potential real increase in the number of safety incidents.
Consider that over the most recent four-month period, the plant recorded 1, 3, 2, 3
safety incidents respectively. According to the SPI criterion, is there evidence that
there has been a real increase in the number of incidents during any of the most
recent four months?
15.54 Refer to Problem 15.53 and consider the supplied data as a random sample
from a Poisson population with unknown mean .
(i) Assuming that a sample size of 48 is large enough for the normal approximation
to be valid for the distribution of the sample mean, use the sample data to test the
hypothesis that the population mean = 0.5 versus the alternative that it is not.
(ii) Use the theoretical value postulated for the population mean and compute the
P (X x0 | = 0.5) for x0 = 1, 2, 3 and hence determine the p-values associated
with the individual hypotheses that each recent observation, 1, 2, and 3, belongs to
the same Poisson population, P(0.5), against the alternative that the observations
belong to a dierent population with a > 0.5.
15.55 In clinics where Assisted Reproductive Technologies such as in-vitro fertilization are used to help infertile couples conceive and bear children (see Chapter 11
for a case study), it is especially important to be able to determine probability of a
single embryo resulting in a live birth at the end of the treatment cycle. As shown
in Chapter 11, determining this parameter, which is equivalent to a binomial probability of success, remains a challenge, however. A typical clinical study, the result of
which may be used to determine this parameter for carefully selected cohort groups,
is described below.
A cohort of 100 patients under the age of 35 years (the Younger group), and
another cohort of the same size, but consisting of patients that are 35 years and older
(the Older group), participated in a clinical study where each patient received ve
embryos in an in-vitro fertilization (IVF) treatment cycle. The results are shown in
the table below: x is the number of live births per delivered pregnancy; yO and yY
represent, respectively, how many in the older and younger group had the pregnancy
outcome of x.
(i) At the = 0.05 signicance level, determine whether or not the single embryo
probability of success parameter, p, is dierent for each cohort group.
642
Random Phenomena
(ii) At the = 0.05, test the hypothesis that p = 0.3 for the older cohort group
versus the alternative that it is less than 0.3. Interpret your result.
(iii) At the = 0.05, test the hypothesis that p = 0.3 for the younger cohort group
versus the alternative that it is greater than 0.3. Interpret your result.
(iv) If it is desired to be able to determine the probability of success parameter for
older cohort group to within 0.05 with 95% condence, determine the size of the
cohort group to use in the clinical study.
x
No. of live
births in a
delivered
pregnancy
0
1
2
3
4
5
yO
Total no. of
older patients
(out of 100)
with pregnancy outcome x
32
41
21
5
1
0
yY
Total no. of
younger patients
(out of 100)
with pregnancy outcome x
8
25
35
23
8
1
15.56 To characterize precisely how many sick days its employees take, a random
sample of 50 employee les was selected and the following statistics were determined:
x
= 9.50 days and s2 = 42.25.
(i) Determine a 95% condence interval on , the true population mean number of
sick days taken per employee.
(ii) Does the evidence in the data support the hypothesis that the mean number of
sick days taken by employees is less than 14.00 days?
(iii) What is the power of the test you conducted in (ii). State any assumptions
clearly. (iv) The personnel director who ordered the study was a bit surprised at
how large the computed sample variance turned out to be. However, the human
resources statistician insisted that this is not necessarily larger than the typical industry value of 2 = 35. Assuming that the sample is from a normal population,
carry out an appropriate test to conrm or refute this claim. What is the p-value
associated with the test?
15.57 It is desired to characterize the precision of two instruments used to measure
the density of a liquid stream in a renerys distillation column. Ten measurements
of a calibration sample with known density 0.85 gm/cc are shown in the table below.
Instrument 1
Measurements
0.864
0.858
0.855
0.764
0.791
0.827
0.849
0.818
0.747
0.846
Instrument 2
Measurements
0.850
0.916
0.861
0.866
0.874
0.901
0.836
0.953
0.733
0.836
Hypothesis Testing
643
Consider these data as random samples from two independent normal populations;
carry out an appropriate test to conrm or refute the hypothesis that instrument 2
is less precise than instrument 1.
15.58 In producing the enzyme cyclodextrin glucosyltransferase in bacterial cultures
via two dierent methods (shaken and surface), Ismail et al., (1996)10 , obtained
the data shown in the table below on the protein content (in mg/ml) obtained by
each method.
Protein content (mg/ml)
Shaken Surface
1.91
1.71
1.66
1.57
2.64
2.51
2.62
2.30
2.57
2.25
1.85
1.15
Is the variability in the protein content the same for both methods? State any
assumptions you may need to make in answering this question.
15.59 The table below (see also Problem 14.41 in Chapter 14) shows the time in
months between occurrences of safety violations for three operators, A, B, and
C, working in a toll manufacturing facility.
A
B
C
1.31
1.94
0.79
0.15
3.21
1.22
3.02
2.91
0.65
3.17
1.66
3.90
4.84
1.51
0.18
0.71
0.30
0.57
0.70
0.05
7.26
1.41
1.62
0.43
2.68
6.75
0.96
0.68
1.29
3.76
Since the random variable in question is exponentially distributed and the sample size of 10 is considerably smaller than is required for a normal approximation
to be valid for the sampling distribution of the sample mean, testing hypotheses
about the dierence between the means of these populations requires a dierent
approach. The precise 95% condence interval estimates of the unknown population
parameters (obtained from the sampling distribution of the mean of an exponential
random variable (Problem 14.41)) can be used to investigate if the population means
overlap. An alternative approach involves the distribution of the dierence between
two exponential random variables.
It can be shown (Exercise 9.3) that given two independent random variables, X1
and X2 , with identical exponential E () distributions, the pdf of their dierence,
Y = X1 X2
(15.182)
1 |y|/
e
; < y <
2
(15.183)
with mean 0 and variance 2 2 . It can be shown that, in part because of the symmetric
10 Ismail A.S, U.I. Sobieh, and A.F. Abdel-Fattah, (1996). Biosynthesis of cyclodextrin
glucosyltransferase and -cyclodextrin by Bacillus macerans 314 and properties of the crude
enzyme. The Chem Eng. J., 61 247253.
644
Random Phenomena
Variance, s2
Plant A
10
28
10.5
Plant B
10
33
13.2
First ascertain whether or not the two population variances are the same. Then
carry out an appropriate test of the equality of the mean production output that is
commensurate with your ndings regarding the variances. Interpret your results.
15.61 In a certain metal oxide ore rening process, several samples (between 6 and
12) are taken monthly from various sections of the huge uidized bed reactor and analyzed to determine average monthly residual silica content. The table below shows
the result of such analyses for a 6-month period.
Month
Jan
Feb
Mar
Apr
May
Jun
Number of
Samples
Analyzed
12
9
10
11
7
11
s
Sample
Standard Deviation
32.1
21.0
24.6
17.8
15.2
17.2
The standard by which the ore rening operation is declared normal for any
month requires a mean silica content = 63.7, and inherent variability, = 21.0;
otherwise the operation is considered abnormal. At a 5% signicance level, identify
those months during which the renery operation would be considered abnormal.
Support your conclusions adequately.
n
S2 =
S12 /S22
12 /22 ; (H0 : 12 = 22 )
n1
i=1 (Xi X)
n
= i=1 Di
D
n
(Di = X1i X2i )
n
2
(D D)
2
SD
= i=1n1i
2 ; (H0 : 2 = 02 )
= 1 2 ; (H0 : = 0 )
(Paired)
T =
Sp2 =
Z=
F =
D
0
s2p n1 + n1
+ n2
S12
S22
(n1)S 2
02
D
0
SD / n
2
1
n1
X
0
/ n
X
0
S/ n
D
0
)
C2 =
T =
T =
= 1 2 ; (H0 : = 0 )
Z=
(S for unknown )
=X
1 X
2
D
Xi
i=1
=
X
; (H0 : = 0 )
n
Table 15.7
Table 15.5
Table 15.4
Table 15.3
Table 15.2
H0 Rejection
Condition
F -test
Table 15.10
Paired t-test
2-sample t-test
2-sample z-test
t-test
z-test
Estimator,
Statistic
Test
(Null Hypothesis, H0 )
TABLE 15.12:
Hypothesis Testing
645
646
Random Phenomena
Chapter 16
Regression Analysis
648
648
651
652
652
653
653
654
657
657
659
661
661
663
664
668
670
671
672
674
676
677
678
682
682
685
686
687
688
689
691
694
694
696
697
698
698
699
700
700
704
704
708
710
711
647
648
Random Phenomena
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
It is often the case in many practical problems that the variability observed in
a random variable, Y , consists of more than just the purely randomly varying
phenomena that have occupied our attention up till now. For this new class
of problems, an underlying functional relationship exists between Y and an
independent variable, x, (deliberately written in the lower case for reasons
that will soon become clear), with a purely random component superimposed
on this otherwise deterministic component. This chapter is devoted to dealing
with problems of this kind. The values observed for the random variable Y
depend on the values of the (deterministic) variable, x, and, were it not for
the presence of the purely random component, Y would have been perfectly
predictable given x. Regression analysis is concerned with obtaining, from
data, the best estimate of the relationship between Y and x.
Although apparently dierent from what we have dealt with up until now,
we will see that regression analysis in fact builds directly upon many of the
results obtained thus far, especially estimation and hypothesis testing.
16.1
Introductory Concepts
Consider the data in Table 16.1 showing the boiling point (in C) of 8
hydrocarbons in a homologous series, along with n, the number of carbon
atoms in each molecule. A scatter plot of boiling point versus n is shown in
Fig 16.1, where we notice right away that as the number of carbon atoms in
this homologous series increases, so does the boiling point of the hydrocarbon compound. In fact, the implied relationship between these two variables
appears to be so strong that one is immediately inclined to conclude that it
must be possible to predict the boiling point of compounds in this series on
the basis of the number of carbon atoms. There is therefore no doubt that
there is some sort of a functional relationship between n and boiling point. If
determined correctly, such a relationship will provide, among other things,
a simple way to capture the extensive data on such physical properties of
compounds in this particular homologous series.
Regression Analysis
649
TABLE 16.1:
Carbon Atoms
C
Compound
Methane
1
-162
Ethane
2
-88
3
-42
Propane
4
1
n-Butane
n-Pentane
5
36
6
69
n-Hexane
7
98
n-Heptane
n-Octane
8
126
100
50
0
-50
-100
-150
-200
0
3
4
5
6
n, Number of Carbon Atoms
FIGURE 16.1: Boiling point of hydrocarbons in Table 16.1 as a function of the number
of carbon atoms in the compound
650
16.1.1
Random Phenomena
Many cases such as the one illustrated above arise in science and engineering where the value taken by one variable appears to depend on the value
taken by another. Not surprisingly, it is customary to refer the variable whose
value depends on the value of another as the dependent variable, while the
other variable is known as the independent variable. It is often desired to capture the relationship between these two variables in some mathematical form.
However, because of measurement errors and other sources of variability, this
exercise requires the use of probabilistic and statistical techniques. Under these
circumstances, the independent variable is considered as a xed, deterministic
quantity that is not subject to random variability. This is perfectly exemplied in n, the number of carbon atoms in the hydrocarbon compounds of Table
16.1; it is a known quantity not subject to random variability. The dependent
variable, on the other hand, is the random variable, subject to a wide variety of potential sources of random variability, including, but not limited to
measurement uncertainties. The dependent variable is therefore represented
as the random variable, Y , while the independent variable is represented as
the deterministic variable, x, represented in the lower case to underscore its
deterministic nature.
The variability observed in the random variable, Y , is typically considered to consist of two distinct components, i.e., for each observation,
Yi , i = 1, 2, . . . , n:
(16.1)
Yi = g(xi ; ) + i
where g(xi ; ) is the deterministic component, a functional relationship, with
as a set of unknown parameters, and i is the random component. The deterministic mathematical relationship between these two variables is a model
of how the independent x (also known as the predictor) aects the the
predictable part of the dependent Y , sometimes known as the response.
In some cases, the functional form of g(xi ) is known from fundamental
scientic principles. For example, if Y is the distance (in cms) traveled in
time ti secs by a particle launched with an initial velocity, u (cm/sec), and
traveling at a constant acceleration a (cm/sec2 ), then we know that
1
g(ti ; u, a) = uti + at2i
2
(16.2)
Regression Analysis
651
16.1.2
(16.3)
where the observed random variability is due to random component i . Furthermore, let the variance of Y be 2 . Then from Eq (16.3), we obtain:
E[Yi ] = + E[i ]
(16.4)
(16.5)
since is a constant. Thus, from the fact that Y has a distribution (unspecied) with mean and variance 2 implies that in Eq (16.3), the random
error term, i has zero mean and variance 2 .
To estimate from the given random sample, it seems reasonable to choose
a value that is as close as possible to all the observed data. This concept
may be represented mathematically as:
min S() =
n
(Yi )2
(16.6)
i=1
= i=1
n
(16.7)
(16.8)
(16.9)
652
Random Phenomena
=
i Yi
(16.11)
= i=1
n
2
i=1 Wi
i=1
where
W2
i = n i 2
(16.12)
i=1 Wi
Note that 0 < i < 1. The result in Eq 16.11 is therefore an appropriately
weighted average a generalization of Eq (20.3) where i = 1/n. This variation on the least-squares approach is known appropriately as weighted leastsquares; we shall encounter it later in this chapter.
16.2
16.2.1
(16.13)
where the random error, , has zero mean and constant variance, 2 . Then,
E(Y |x), the conditional expectation of Y given a specic value for x is:
Y |x = E(Y |x) = x,
(16.14)
Regression Analysis
653
recognizable as the equation of a straight line with slope and zero intercept.
It is also known as the one-parameter regression model, a classic example
of which is the famous Ohms law in physics: the relationship between the
Voltage, V , across a resistor with unknown resistance, R, and the current I
owing through the resistive element, i.e.,
V = IR
(16.15)
n
(yi xi )2
(16.16)
i=1
n
xi (yi xi ) = 0
(16.17)
i=1
n
xi yi
= i=1
n
2
i=1 xi
(16.18)
This is the expression for the slope of the best (i.e., least-squares) straight
line (with zero intercept) through the points (xi , yi ).
16.2.2
Two-Parameter Model
(16.19)
(16.21)
In this particular case, regression analysis is primarily concerned with obtain = (0 , 1 ); characterizing the random sequence i ;
ing the best estimates for
= (0 , 1 ).
and, making inference about the parameter estimates,
654
Random Phenomena
Y
PY|x= T0 + T1x
H
H
H
H
H
H
FIGURE 16.2: The true regression line and the zero mean random error i
Primary Model Assumption
In this case, the true but unknown regression line is represented by Eq
(16.21), with data scattered around it. The fact that E() = 0, indicates that
the data scatters evenly around the true line; more precisely, the data varies
randomly around a mean value that is the function of x dened by the true
but unknown regression line in Eq (16.21). This is illustrated in Figure 16.2
It is typical to assume that each i , the random component of the model, is
mutually independent of the others and follows a Gaussian distribution with
zero mean and variance 2 , i.e., i N (0, 2 ). The implication in this particular case is therefore that each data point, (xi , yi ), comes from a Gaussian
distribution whose mean is dependent on the value of x, and falls on the true
regression line, as illustrated in Fig 16.3. Equivalently, the true regression line
passes through the mean of the series of Gaussian distributions having the
same variance. The two main assumptions underlying regression analysis may
now be summarized as follows:
1. i forms an independent random sequence, with zero mean and variance
2 that is constant for all x;
2. i N (0, 2 ) so that Yi (0 + 1 x, 2 )
Ordinary Least Squares (OLS) Estimates
Obtaining the least-squares estimates of the intercept, 0 , and slope, 1 ,
from data (xi , yi ) involves minimizing the sum-of-squares function,
S(0 , 1 ) =
n
i=1
[yi (1 xi + 0 )]2
(16.22)
Regression Analysis
655
Y|x = x +
x2
x1
x3
x5
x4
x6
FIGURE 16.3: The Gaussian assumption regarding variability around the true regression line giving rise to N (0, 2 ): The 6 points represent the data at x1 , x2 , . . . , x6 ;
the solid straight line is the true regression line which passes through the mean of the
sequence of the indicated Gaussian distributions
where the usual rst derivatives of the calculus approach yield:
n
S
0
= 2
S
1
= 2
[yi (1 xi + 0 )] = 0
i=1
n
xi [yi (1 xi + 0 )] = 0
(16.23)
(16.24)
i=1
n
xi + 0 n =
i=1
n
x2i + 0
i=1
n
xi
n
i=1
n
i=1
yi
(16.25)
xi yi
(16.26)
i=1
y i = 1
n
xi + 0 n
(16.27)
i=1
where the last term involving i has vanished upon the assumption that n is
656
Random Phenomena
suciently large so that because E(i ) = 0, the sum will be close to zero (a
point worth keeping in mind to remind the reader that the result of solving
the normal equations provide estimates, not precise values).
Also, multiplying the model equation by xi and summing yields:
n
yi xi = 1
n
i=1
x2i + 0
i=1
n
xi
(16.28)
i=1
where, once again the last term involving i has vanished because of independence with xi and the assumption, once again, that n is suciently large that
the sum will be close to zero. Note that these two equations are identical to
the normal equations; more importantly, as derived by summation from the
original model they are the sample equivalents of the following expectations:
E(Y ) =
E(Y x) =
1 E(x) + 0
1 E(x2 ) + 0 E(x)
(16.29)
(16.30)
which should help put the emergence of the normal equations into perspective.
Returning to the task of computing least squares estimates of the two
model parameters, let us dene the following terms:
Sxx
Syy
Sxy
=
=
=
n
i=1
n
i=1
n
(xi x
)2
(16.31)
(yi y)2
(16.32)
(xi x
)(yi y)
(16.33)
i=1
n
n
where y = ( i=1 yi )/n and x = ( i=1 xi )/n represent the usual averages.
When expanded out and consolidated, these equations yield:
n
2
n
nSxx = n
x2i
xi
(16.34)
i=1
nSyy
nSxy
= n
= n
n
i=1
n
yi2
xi yi
i=1
i=1
2
n
yi
i=1
n
(16.35)
n
xi
yi
i=1
(16.36)
i=1
Sxy
Sxx
(16.37)
y 1 x
(16.38)
Regression Analysis
657
Nowadays, such computations implied in this derivation are no longer carried out by hand, of course, but by computer programs; the foregoing discussion is therefore intended to acquaint the reader with the principles and
mechanics underlying the numbers produced by the statistical software packages.
Maximum Likelihood Estimates
Under the Gaussian assumption, the regression equation, written in the
more general form,
Y = (x, ) + ,
(16.39)
implies that the observations Y1 , Y2 , . . . , Yn come from a Gaussian distribution with mean and variance, 2 ; i.e. Y N ((x, ), 2 ). If the data can
be considered as a random sample from this distribution, then the method of
maximum likelihood presented in Chapter 14 may be used to estimate (x, )
and 2 in precisely the same manner in which estimates of the N (, 2 ) population parameters were determined in Section 14.3.2. The only dierence this
time is that the population mean, (x, ), is no longer constant, but a function of x. It can be shown (see Exercise 16.5) that when the variance 2 is
constant, the maximum likelihood estimate for in the one-parameter model,
(x, ) = x
(16.40)
(16.41)
(16.42)
658
Random Phenomena
TABLE 16.2:
Density (in gm/cc) and
weight percent of ethanol
in ethanol-water mixture
Density
Wt %
(g/cc)
Ethanol
0.99823
0
0.98938
5
0.98187
10
0.97514
15
0.96864
20
0.96168
25
0.95382
30
0.94494
35
(or prediction) of the true but unknown value of the observation yi (unknown
because of the additional random eect, i ). If we now dene as ei , the error
between the actual observation and the estimated value, i.e.,
ei = yi yi ,
(16.43)
this term is known as the residual error or simply the residual; it is our best
estimate of the unknown i , just as y = 0 + 1 x is our best estimate of the
true regression line Y |x = E(Y |x) = 1 x + 0 .
As discussed shortly (Section 16.2.10), the nature of the sequence of residuals provides a great deal of information about how well the model represents
the observations.
Example 16.1: DENSITY OF ETHANOL-WATER MIXTURE
An experimental investigation into how the density of an ethanol-water
mixture varies with weight percent of ethanol in the mixture yielded
the result shown in Table 16.2. Postulate a linear two-parameter model
as in Eq (16.19), and use the supplied data to obtain least-squares estimates of the slope and intercept, and also the residuals. Plot the data
versus the model and comment on the t.
Solution:
Given this data set, just about any software package, from Excel to
MATLAB and MINITAB, will produce the following estimates:
1 = 0.001471; 0 = 0.9975
(16.44)
(16.45)
The model t to the data is shown in Fig 16.4; and for the given values
Regression Analysis
659
TABLE 16.3:
1.00
S
R-Sq
R-Sq(adj)
0.0008774
99.8%
99.8%
0.99
Density = 0.9975 - 0.001471 Wt%Ethanol
Density
0.98
0.97
0.96
0.95
0.94
0
10
20
Wt%Ethanol
30
40
FIGURE 16.4: The tted straight line to the Density versus Ethanol Weight % data:
The additional terms included in the graph, S, R-Sq and R-Sq(adj) are discussed later
of x, the estimated y, and the residuals, e, are shown in Table 16.3.
Visually, the model seems to t quite well.
This model allows us to predict solution density for any given weight
percent of ethanol within the experimental data range but not actually
part of the data. For example, for x = 7.5, Eq (16.45) estimates y =
0.98647. How the residuals are analyzed is discussed in Section 16.2.10.
Expressions such as the one obtained in this example, Eq (16.45), are sometimes known as calibration curves. Such curves are used to calibrate measurement devices such as thermocouples, where the raw instrument output (say
millivolts) is converted to the actual desired measurement (say temperature in
C) based on expressions such as the one obtained here. Such expressions are
typically generated from standardized experiments where data on instrument
output are gathered for various objects with known temperature.
660
16.2.3
Random Phenomena
When experiments are repeated for the same xed values xi , as a typical
consequence of random variation, the corresponding value observed for Yi will
dier each time. The resulting estimates provided in (16.38) and Eqs (16.37)
therefore will also change slightly each time. In typical fashion, therefore, the
specic parameter estimates 0 and 1 are properly considered as realizations
of the respective estimators 0 and 1 , random variables that depend on the
random sample Y1 , Y2 , . . . , Yn . It will be desirable to investigate the theoretical
properties of these estimators dened by:
1
0
Sxy
Sxx
= Y 1 x
=
(16.46)
(16.47)
Let us begin with the expected values of these estimators. From here, we
observe that
Sxy
(16.48)
E(1 ) = E
Sxx
which, from the denitions given above, becomes:
n
1
E
Yi (xi x
)
E(1 ) =
Sxx
i=1
(16.49)
n
) = 0, since Y is a constant); and upon introducing
(because i=1 Y (xi x
Eq (16.19) in for Yi , we obtain:
n
1
E(1 ) =
E
(1 xi + 0 + i )(xi x
)
(16.50)
Sxx
i=1
A term-by-term expansion and subsequent simplication results in
n
1
E(1 ) =
E 1
(xi x
)
Sxx
i=1
(16.51)
n
n
because i=1 (xi x
) = 0 and E[ i=1 i (xi x
)] = 0 since E(i ) = 0. Hence,
Eq (16.51) simplies to
E(1 ) =
1
1 Sxx = 1
Sxx
(16.52)
(16.53)
Regression Analysis
661
(16.54)
2
V ar(0 ) =
0
2
Sxx
x2
1
+
2
n Sxx
(16.55)
(16.56)
where 2 is the variance of the random component, . Consequently, the standard error of each estimate, the positive square root of the variance, is given
by:
SE(1 ) =
SE(0 ) =
16.2.4
Sxx
&
1
x
2
+
n Sxx
(16.57)
(16.58)
Condence Intervals
As with all estimation problems, the point estimates obtained above for
the regression parameters, 0 and 1 , by themselves are insucient in making
decisions about their true, but unknown values; we must add a measure of
how precise these estimates are. Obtaining interval estimates is one option; and
such interval estimates are determined for regression parameters essentially by
the same procedure as that presented in Chapter 14 for population parameters.
This, of course, requires sampling distributions.
Slope and Intercept Parameters
Under the Gaussian distributional assumption for , with the implication
that the sample Y1 , Y2 , . . . , Yn , possesses the distribution N (0 + 1 x, 2 ), and
from the results obtained above about the characteristics of the estimates, it
can be shown that the random variables 1 and 0 , respectively the slope
2
2
) and 0 N (0 ,
)
and the intercept, are distributed as 1 N (1 ,
1
0
with the variances as shown in Eqns (16.55) and (16.56), provided the data
variance, 2 , is known. However, this variance is not known and must be
estimated from data. This is done as follows for this particular problem.
Consider residual errors, ei , our best estimates of i ; dene the residual
662
Random Phenomena
=
=
n
i=1
n
(yi yi )2
(16.59)
[yi (1 xi + 0 )]2
i=1
n
(16.60)
i=1
(16.61)
E(SSE ) = (n 2) 2
(16.62)
SSE
=
=
(n 2)
n
y1 )2
n2
i=1 (yi
(16.63)
is an unbiased estimate of 2 .
Now, as with previous statistical inference problems concerning normal
populations with unknown , by substituting s2e , the mean residual sum-ofsquares, for 2 , we have the following results: the statistics T1 and T0 dened
as:
1 1
T1 =
(16.64)
se / Sxx
and
T0 =
0 0
)
1
x
2
se
+
n
Sxx
(16.65)
=
=
se
1 t/2 (n 2)
Sxx
&
0 t/2 (n 2)se
(16.66)
1
x
2
+
n Sxx
(16.67)
Regression Analysis
663
0.001471 0.00006607
(16.68)
0.9975 0.001385
(16.69)
Regression Line
The actual regression line t (see for example Fig 16.4), an estimate of the
true but unknown regression line, is obtained by introducing into Eq (16.21),
the estimates for the slope and intercept parameters to give
Y |x = 1 x + 0
(16.70)
Y |x = 1 x + 0
(16.71)
is the estimate of the actual response of Y at this point (akin to the sample
average estimate of a true but unknown population mean).
In the same manner in which we obtained condence intervals for sample
averages, we can also obtain a condence interval for
Y |x . It can be shown
from Eq (16.71) (and Eq (16.56)) that the associated variance is:
(x x
1
)2
V ar(
Y |x ) = 2
+
(16.72)
n
Sxx
and because of the normality of the random variables 0 and 1 , then if is
known,
Y |x has a normal distribution with mean (1 x + 0 ) and variance
shown in Eq (16.72). With unknown, substituting se for it, as in the previous
section, leads to the result that the specic statistic,
tRL =
(
Y |x Y |x )
)'
(
(x
x )2
1
se
n +
Sxx
(16.73)
664
Random Phenomena
1.00
Regression
95% C I
0.99
S
R-Sq
R-Sq(adj)
0.0008774
99.8%
99.8%
Density
0.98
0.97
0.96
0.95
0.94
0
10
20
Wt%Ethanol
30
40
FIGURE 16.5: The tted regression line to the Density versus Ethanol Weight % data
(solid line) along with the 95% condence interval (dashed line). The condence interval
is narrowest at x = x
and widens for values further away from x
.
Y |x = (1 x + 0 ) t/2 (n 2)se
(16.74)
n
Sxx
When this condence interval is computed for all values of x of interest, the
result is a condence interval around the entire regression line. Again, as most
statistical analysis software packages have the capability to compute and plot
this condence interval along with the regression line, the primary objective
of this discussion is to provide the reader with a fundamental understanding
of the theoretical bases for these computer outputs. For example, the 95%
condence interval for the Density-Wt% Ethanol problem in Examples 16.1
and 16.2 is shown in Fig 16.5
)2 term in Eq (16.74), a signature characteristic of
By virtue of the (x x
these condence intervals is that they are narrowest when x = x
and widen
for values further away from x
.
16.2.5
Hypothesis Testing
Regression Analysis
665
not be zero). This translates to the following hypotheses regarding the slope
parameter:
H0 : 1 = 0
Ha : 1 = 0
(16.75)
And from the preceding discussion regarding condence intervals, the appropriate test statistic for this test, from Eq (16.64), is:
t1 =
se / Sxx
(16.76)
since the postulated value for the unknown 1 is 0; and the decision to reject
or not reject H0 follows the standard two-sided t-test criteria; i.e., at the
signicance level , H0 is rejected when
t1 < t/2 (n 2), or t1 > t/2 (n 2)
(16.77)
(16.78)
t0 =
se
0
1
n
x
2
Sxx
(16.79)
(16.80)
666
Random Phenomena
TABLE 16.4:
individuals
Cranial
Circum (cms)
Finger
Length (cms)
Cranial
Circum (cms)
Finger
Length (cms)
7.9
8.4
7.7
8.6
8.6
7.9
8.2
8.1
8.1
7.9
8.1
8.2
7.8
7.9
(16.81)
and the model t to the data is shown in Fig 16.6. (Again, we defer
until the appropriate place, any comment on the terms included in the
last line of the MINITAB output.)
It is important to note how, rather than clustering tightly around the
regression line, the data shows instead a signicant amount of scatter,
which, at least visually, calls into question the postulated dependence
of cranial circumference on nger length. This question is settled concretely by the computed T statistics for the model parameters and the
Regression Analysis
667
S
R-Sq
60
R-Sq(adj)
2.49655
4.7%
0.0%
Cranial Circ(cm)
59
58
57
56
55
54
53
52
7.50
7.75
8.00
8.25
Finger Length(cm)
8.50
8.75
FIGURE 16.6: The tted straight line to the Cranial circumference versus Finger
length data. Note how the data points are widely scattered around the tted regression
line. (The additional terms included in the graph, S, R-Sq and R-Sq(adj) are discussed
later)
associated p-values. The p-value of 0.025 associated with the constant
(intercept parameter, 0 ) indicates that we must reject the null hypothesis that 0 = 0 in favor of the alternative that the estimated value,
43.0, is signicantly dierent from zero, at the 5% signicance level.
On the other hand, the corresponding p-value associated with the 1 ,
the coecient of x, the nger length (i.e., the regression line slope), is
0.422, indicating that there is no evidence to reject the null hypothesis.
Thus, at the 5% signicance level, 1 is not signicantly dierent from
zero and we therefore conclude that there is no discernible relationship
between cranial circumference and nger length.
Thus, the implication of the signicance of the constant term, and
non-signicance of the coecient of the nger length is two-fold: (i)
that cranial circumference does not depend on nger length (at least
for the 16 individuals in this study), so that the observed variability is
purely random, with no systematic component that can be explained
by nger length; and consequently, (ii) that the cranial circumference is
best characterized for this population of individuals by the mean value
(43.0 cm), a value that is signicantly dierent from zero (as one would
certainly expect!)
668
Random Phenomena
16.2.6
(16.82)
(16.83)
(16.84)
Regression Analysis
669
2
which, under the normality assumption, possesses the distribution N (0, E
),
p
with the variance obtained from Eq (16.84) as
2
E
p
= V ar[Y (x )] + V ar[Y (x )]
(x x
)2
2
2 1
= +
+
n
Sxx
(x
x)2
1
2
= 1+ +
n
Sxx
(16.85)
(16.86)
As with the condence intervals around the regression line, these prediction
and widen for values further away from
intervals are narrowest when x = x
x
, but they are consistently wider than the condence intervals.
Example 16.4: HIGHWAY GASOLINE MILEAGE AND ENGINE CAPACITY FOR TWO-SEATER AUTOMOBILES
From the data shown in Table 12.5 of Chapter 12 on gasoline mileage for
a collection of two-seater cars, postulate a linear two-parameter model
as in Eq (16.19), for highway mileage (y) as a function of the engine
capacity, x; obtain least-squares estimates of the parameters for all the
cars, leaving out the Chevrolet Corvette and the Dodge Viper data
(these cars were identied in that chapter as dierent from the others
in the class because of the material used for their bodies). Show a plot
of the tted regression line, the 95% condence interval and the 95%
prediction interval.
Solution:
Using MINITAB for this problem produces the following results:
Regression Analysis: MPGHighway versus EngCapacity
The regression equation is
670
Random Phenomena
Fitted Line Plot
MPGHighway = 33.15 - 2.739 EngCapacity
35
Regression
95% CI
95% PI
MPG Highway
30
S
R-Sq
R-Sq(adj)
25
1.95167
86.8%
86.0%
20
15
10
5
1
4
5
6
Engine Capacity (Liters)
FIGURE 16.7: The tted straight line to the Highway MPG versus Engine Capacity
data of Table 12.5 (leaving out the two inconsistent data points) along with the 95%
condence interval (long dashed line) and the 95% prediction interval (short dashed
line). (Again, the additional terms included in the graph, S, R-Sq and R-Sq(adj) are
discussed later).
P
0.000
0.000
(16.87)
and, since the p-values associated with each parameter are both zero
to three decimal places, we conclude that these parameters are signicant. The implication is that for every liter increase in engine capacity,
the average two-seater car is expected to lose about 2 and 3/4 miles per
gallon on the highway. (As before, we defer until later any comment on
the terms in the last line of the MINTAB output.)
The model t to the data is shown in Fig 16.7 along with the required
95% condence interval (CI) and the 95% prediction interval (PI). Note
how much wider the PI is than the CI at every value of x.
Regression Analysis
16.2.7
671
Beyond hypotheses tests to determine the signicance of individual estimated parameters, other techniques exist for assessing the overall eectiveness
of the regression model, based on measures of how much of the total variability
in the data has been captured (or explained) by the model.
Orthogonal Decomposition of Variability
n
The total variability present in the data, represented by i=1 (yi y)2 ,
and dened as Syy in Eq (16.32), may be rearranged as follows, merely by
adding and subtracting yi :
Syy
n
(yi y)2
i=1
n
[(yi yi ) (
y yi )]
(16.88)
i=1
Upon expanding and simplifying (see Exercise 16.9), one obtains the very
important expression:
n
(yi y)2
i=1
n
(
yi y)2 +
i=1
or Syy
n
(yi yi )2
i=1
= SSR + SSE
(16.89)
where we have recalled that the second term on the RHS of the equation is
the residual error sum of squares dened in Eq (16.59), and have introduced
the term SSR to represent the regression sum of squares,
SSR =
n
(
yi y)2
(16.90)
i=1
(16.91)
672
Random Phenomena
||(
y y)||
||e||2
Syy
(16.92)
=
=
SSR
SSE
(16.93)
(16.94)
with the very important implication that, as a result of the vector representation in Eq (19.11), the expression in Eq (16.89) is an orthogonal decomposition
of the data variance vector reminiscent of Pythagoras Theorem. (If the vector
sum in Eq (19.11) holds simultaneously as the corresponding sums of squares
expression in Eq (16.89), then the vector (
y y) must be orthogonal to the
vector e.)
Eq (16.89) is in fact known as the analysis of variance (ANOVA) identity;
and it plays a central role in statistical inference that transcends the restricted
role observed here in regression analysis. We shall have cause to revisit this
subject in our discussion of the design of experiments in upcoming chapters.
For now, we use it to assess the eectiveness of the overall regression model (as
a single entity purporting to represent the information contained in the data),
rst in the form of the coecient of determination, and later as the basis for
an F -test of signicance. This latter exercise will constitute a preview of an
upcoming, more general discussion of ANOVA.
R2 , The Coecient of Determination
Let us now consider the ratio dened as:
R2 =
SSR
Syy
(16.95)
which represents the proportion of the total data variability (around the mean
y) that has been captured by the regression model; its complement,
1 R2 = SSE /Syy
(16.96)
Regression Analysis
673
fully later). A somewhat more judicious assessment of model adequacy requires adjusting the value of R2 to reect the number of parameters that have
been used by the model to capture the variability.
By recasting the expression in Eq (16.95) in the equivalent form:
R2 = 1
SSE
Syy
(16.97)
rather than base the metric on the indicated absolute sums of squares, consider
using the mean sums of squares instead. In other words, instead of the total
residual error sum of squares, SSE , we employ instead the mean residual error
sum of squares, SSE /(np) where p is the number of parameters in the model
and n is the total number of experimental data points; also instead of the total
data sum of squares, Syy , we employ instead the data variance Syy /(n 1).
2
, and dened as:
The resulting quantity, known as Radj
2
Radj
= 1
SSE /(n p)
Syy /(n 1)
(16.98)
674
Random Phenomena
(yi y)
i=1
n
(
yi y) +
i=1
i.e Syy
n
(yi yi )2
i=1
= SSR + SSE
(16.99)
= (n 2) 2
E(SSR )
2
= 1 Sxx + 2
(16.100)
Regression Analysis
TABLE 16.5:
675
Testing Signicance
Mean
Square
M SR
M SE
MSR
MSE
(16.102)
(16.103)
and reject or fail to reject the null hypothesis on the basis of the actual p-value.
These results are typically presented in what is referred to as an ANOVA
Table as shown in Table 16.5. They are used to carry out F -tests for the
signicance of the entire regression model as a single entity; if the resulting
p-value is low, we reject the null hypothesis and conclude that the regression is
signicant, i.e., the relationship implied by the regression model is meaningful.
Alternatively, if the p-value exceeds a pre-specied threshold, (say, 0.05), we
fail to reject the null hypothesis and conclude that the regression model is not
signicant that the implied relationship is purely random.
All computer programs that perform regression analysis produce such
ANOVA tables. For example, the MINITAB output for Example 16.4 above
(involving the regression model relating engine capacity to the highway mpg
rating) includes the following ANOVA table.
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
1 402.17 402.17 105.58 0.000
Residual Error 16 60.94
3.81
Total
17 463.11
676
Random Phenomena
The indicated p-value of 0.000 implies that we must reject the null hypothesis
and conclude that the regression model is signicant.
On the other hand, the ANOVA table produced by MINITAB for the cranial circumference versus nger length regression problem of Example 16.3 is
as shown below:
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
1
4.259 4.259 0.68 0.422
Residual Error 14 87.259 6.233
Total
15 91.517
In this case, the p-value associated with the F -test is so high (0.422) that we
reject the null hypothesis and conclude that the regression is not signicant.
Of course, these conclusions agree perfectly with our earlier conclusions
concerning each of these problems.
In general, we tend to de-emphasize these ANOVA-based F -tests for signicance of the regression. This is for the simple reason that they are coarse
tests of the overall regression model, adding little or nothing to the individual
t-tests presented earlier for each parameter. These individual parameter tests
are preferred because they are ner-grained.
From this point on, we will no longer refer to these ANOVA tests of signicance for regression. Nevertheless, these same concepts take center stage
in Chapter 19 where they are central to analysis of designed experiments.
16.2.8
E[XY ]
X Y
(16.104)
The sample version, obtained from data, known as the Pearson product moment correlation coecient, is:
n
(xi x)(yi y)
n
(16.105)
r = n i=1
)2
)2
i=1 (xi x
i=1 (yi y
If we now recall the expressions in Eqs (16.3116.33), we immediately obtain,
in the context of regression analysis:
r=
Sxy
Sxx Syy
(16.106)
And now, in terms of the slope parameter estimate, we obtain from Eq (16.37),
rst that
Sxx
(16.107)
r = 1
Syy
Regression Analysis
677
an expression we shall return to shortly. For now, let us return to the expression for R2 in Eq (16.97); if we introduce Eq (16.61) for SSE , we obtain
R2 = 1
Sxy
Syy 1 Sxy
= 1
Syy
Syy
(16.108)
(Sxy )2
Sxx Syy
(16.109)
which, when compared with Eq (16.106) establishes the important result that
R2 , the coecient of determination, is the square of the sample correlation
coecient, r; i.e.,
R2 = r 2 .
(16.110)
16.2.9
Mean-Centered Model
(16.112)
(16.113)
where sy = Syy /(n 1) and sx = Sxx /(n 1), are, respectively sample
estimates of the data standard deviation for y and for x. Alternatively, Eq
(16.114) could equivalently be written as
&
Syy
(x x
)
(16.115)
(
y y) = R2
Sxx
This equation provides the clearest indication of the impact of R2 on how
strongly the mean-centered value of the predictor, x, is connected by the
678
Random Phenomena
model to and hence can be used to estimate the mean-centered response. Observe that in the density and weight percent ethanol example, with
R2 = 0.998, the connection between the predictor and response estimate is
particularly strong; with the cranial circumference-nger length example, the
connection is extremely weak, and the best estimate of the response (cranial
circumference) for any value of the predictor (nger length), is essentially the
mean value, y.
16.2.10
Residual Analysis
Regression Analysis
679
TABLE 16.6:
Thermal
conductivity measurements at various
temperatures for a metal
k (W/m- C) Temperature ( C)
93.228
100
92.563
150
99.409
200
101.590
250
111.535
300
115.874
350
119.390
400
126.615
450
many cases, however, available data is usually modest. Regardless of the data
size, plots of the residuals themselves versus the tted value, y, or versus data
order, or versus x, are not only capable of indicating model adequacy; they
also provide clues about the nature of the implied model inadequacy.
It is often recommended that residual plots be based not on the residual
themselves, but on the standardized residual,
ei =
ei
se
(16.116)
680
Random Phenomena
k = 79.6 + 0.102 Temperature
Predictor
Coef
SE Coef
Constant
79.555
2.192
Temperature 0.101710 0.007359
T
36.29
13.82
P
0.000
0.000
(16.117)
The p-values associated with each parameter is zero, implying that both
parameters are signicantly dierent from zero. The estimate of the data
2
values indicate
standard deviation is shown as S; and the R2 and Radj
that the model captures a reasonable amount of the variability in the
data.
The actual model t to the data is shown in the top panel of Fig
16.8 while the standardized residuals versus tted value, yi , is shown
in the bottom panel. With only 8 data points, there are not enough
residuals for a histogram plot. Nevertheless, upon visual examination
of the residual plots, there appears to be no discernible pattern, nor is
there any reason to believe that the residuals are anything but purely
random. Note that no standardized residual value exceeds 2.
The model is therefore considered to provide a reasonable representation of how the thermal conductivity of this metal varies with
temperature.
Regression Analysis
681
S
R-Sq
R-Sq(adj)
2.38470
97.0%
96.4%
120
110
100
90
100
200
300
Temperature, Deg C
400
500
Versus Fits
(response is k)
Standardized Residual
-1
-2
90
100
110
Fitted Value
120
130
Fitted straight line to the Thermal conductivity (k) versus Temperature (T C) data in
Table 16.6; Bottom: standardized residuals versus tted value, yi .
682
Random Phenomena
Therefore, as before, the tted regression line equation is
y = 39.45x 172.8
(16.118)
(16.119)
Such a model is a bit more complicated, but the residual structure seems
to suggest that the additional term is warranted. How to obtain a model
of this kind is discussed shortly.
16.3
16.3.1
Regression Analysis
683
S
R-Sq
R-Sq(adj)
100
17.0142
97.4%
97.0%
50
0
-50
-100
-150
-200
0
2
3
4
5
6
n, Number of Carbon Atoms
Versus Fits
(response is Boiling Point)
1.0
Standardized Residual
0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
-150
-100
-50
0
Fitted Value
50
100
150
FIGURE 16.9: Modeling the dependence of the boiling points (BP) of hydrocarbon
compounds in Table 16.1 on the number of carbon atoms in the compound: Top: Fitted straight line of BP versus n, the number of carbon atoms; Bottom: standardized
residuals versus tted value yi . Notice the distinctive quadratic structure left over
in the residuals exposing the linear models over-estimation at the extremes and the
under-estimation in the middle.
684
Random Phenomena
(16.122)
g
i
(16.123)
0 + 1 x + 2 x 2 + 3 x 3 +
(16.124)
(2) Y
1 2 x +
(16.125)
(3) Y
1 ex1 + 2 ln x2 +
3
+
x3
(16.126)
Regression Analysis
685
Solution:
Model (1) presents a linear regression problem because each of the sensitivities, S0 = 1; S1 = x; S2 = x2 ; and S3 = x3 , is free of the
unknown parameter on which it is based. i.e., Si is not a function of
i for i = 0, 1, 2, 3. In fact, all the sensitivities are entirely free of all
parameters.
Model (2) on the other hand presents a nonlinear regression problem:
S1 = 2 x does not depend on 1 , but
S2 = 1 x2 x1
(16.127)
depends on 2 . Thus, while this model is linear in 1 (because the sensitivity to 1 does not depend on 1 ), it is nonlinear in 2 ; therefore, it
presents a nonlinear regression problem.
Model (3) presents a linear regression problem: S1 = ex1 ; S2 =
ln x2 , are both entirely free of unknown parameters.
16.3.2
Variable Transformations
(16.128)
In this case, observe that if we now let Y = ln C(t), and let 0 = ln 0 , then
Eq (16.128) represents a linear regression model.
Such cases abound in chemical engineering. For example, the equilibrium
vapor mole fraction, y, as a function of liquid mole fraction, x, of a compound
with relative volatility is given by the expression:
y=
x
1 + ( 1)x
(16.129)
(16.130)
686
Random Phenomena
reason that in virtually all cases, if the error term is additive, then even the
obvious transformations are no longer possible. For example, if each actual
concentration measurement, C(ti ), observed at time ti has associated with it
the additive error term i , then Eq (16.122) must be rewritten as
C(ti ) = 0 e1 ti + i
(16.131)
(16.132)
But this implies that the error is multiplicative in the original variable. It is
important, before taking such a step, to take time to consider whether such a
multiplicative error structure is reasonable or not.
Thus, in employing transformations to deal with these so-called intrinsically linear models, the most important issue lies in determining the proper
error structure. Such transformations should be used with care; alternatively,
the parameter estimates obtained from such an exercise should be considered as approximations that may require further renement by using more
advanced nonlinear regression techniques. Notwithstanding, many engineering problems involving models of this kind have beneted from the sort of
linearizing transformations discussed here.
16.4
(16.133)
(16.134)
Regression Analysis
687
(16.135)
(16.136)
16.4.1
n
(16.137)
i=1
The technique calls for taking derivatives with respect to each parameter, setting the derivative to zero and solving the resultant equations for the unknown
688
Random Phenomena
parameters, i.e.,
n
S
= 2
[yi (0 + 1 x1i + 2 x2i + + m xmi )] = 0
0
i=1
(16.138)
and for 1 j m,
n
S
= 2
xji [yi (0 + 1 x1i + 2 x2i + + m xmi )] = 0
j
i=1
(16.139)
These expressions rearrange to give the general linear regression normal equations:
n
yi
= 0 n + 1
i=1
n
i=1
n
n
x1i + 2
i=1
yi x1i
yi xji
i=1
= 0
= 0
n
i=1
n
x1i + 1
xji + 1
i=1
n
x2i + + m
i=1
n
i=1
n
x21i + 2
n
xmi
i=1
n
x2i x1i + + m
i=1
x1i xji + 2
i=1
n
n
xmi x1i
i=1
x2i xji + + m
i=1
n
xmi xji
i=1
16.4.2
Matrix Methods
1 x11
y1
y2 1 x21
.. = ..
..
. .
.
yn
1 xn1
0
1
x12 x1m
xx2 x2m
1 2
(16.140)
..
..
.. .. + ..
.
.
. . .
xn2 xnm
m
n
(16.141)
Regression Analysis
689
(16.144)
XT y
(16.145)
1
XT (X + )
XT X
1
XT
= + XT X
(16.146)
We may now use this expression to obtain the mean and variance of these
estimates as follows. First, by taking expectations, we obtain:
=
E()
=
1
XT E()
+ XT X
(16.147)
as given
because X is known and E() = 0. Thus, the least-squares estimate
in Eq (16.145) is unbiased for . As for the co-variance of the estimates, rst,
by denition, E(T ) = is the random error covariance matrix; then from
the assumption that each i is independent, and identically distributed, with
the same variance, 2 , we have that
= 2 I
(16.148)
where 2 is the variance associated with each random error, and I is the
identity matrix. As a result,
(
'
= E (
)(
)T
V ar()
(16.149)
690
Random Phenomena
1
1
XT X
XT E(T )X XT X
1
1
=
XT X
XT 2 IX XT X
1
=
XT X
2
=
(16.150)
(16.151)
= X
y
(16.152)
e=yy
(16.153)
(16.154)
eT e
np
(16.155)
is an unbiased estimate of .
Thus, following the typical normality assumption on the random error
component of the regression model, we now conclude that the least-squares
has a multivariate normal distribution, M V N (, ), with
estimate vector, ,
the covariance matrix as given in Eq (16.151). This fact is used to test hypotheses regarding the signicance or otherwise of the parameters in precisely
the same manner as before. The t-statistic arises directly from substituting
data estimate se for in Eq (16.151).
Thus, when cast in matrix form, the multiple regression problem, is seen
to be merely a higher dimensional form of the earlier simple linear regression
problem; the model equations are structurally similar:
y =
y =
X +
x +
Regression Analysis
691
=
or =
1
XT X
XT y
n
xi yi
i=1
n
2
i=1 xi
Sxy
Sxx
(16.156)
The computations for multiple regression problems become rapidly more complex, but all the results obtained earlier for the simpler regression problem
transfer virtually intact, including hypothesis tests of signicance for the parameters, the values for the coecient of determination, R2 (and its ad2
justed variant, Radj
) for assessing the model adequacy in capturing the data
information. Fortunately, these computations are routinely carried out very
conveniently by computer software packages. Nevertheless, the reader is reminded that the availability of these computer programs has relieved us only
of the computational burden; the task of understanding what these computations are based on remains very much an important responsibility of the
practitioner.
Residuals Analysis
The residuals in multiple linear regression are given by Eq (16.153) above.
Obtaining the standardized version of residuals in this case requires the introduction of a new matrix, H, the so-called hat matrix. If we introduce the
least-squares estimate into the expression for the vector of model estimates in
Eq (16.152), we obtain:
1
= X XT X
XT y
y
= Hy
where
1
H = X XT X
XT
(16.157)
(16.158)
is called the hat matrix because it relates the actual observations vector,
. This matrix has some unique characteristics:
y, to the vector of model ts, y
for example, it is an idempotent matrix, meaning that
HH = H
(16.159)
(16.160)
692
Random Phenomena
Surface Plot of Yield vs Temp, Pressure
90
Yield
88
100
95
86
90
1.4
1.6
Pressure
1.8
2.0
Temp
85
ei
(16.161)
se 1 hii
These standardized residuals are the exact equivalents of the ones shown in
Eq (16.116) for the simple linear regression case.
The next example illustrates an application of these results.
ei =
Regression Analysis
TABLE 16.7:
Laboratory experimental
data on Yield obtained from a catalytic
process at various temperatures and pressures
Yield (%) Temp ( C) Pressure (Atm)
86.8284
85
1.25
87.4136
90
1.25
86.2096
95
1.25
87.8780
100
1.25
86.9892
85
1.50
86.8632
90
1.50
86.8389
95
1.50
88.0432
100
1.50
86.8420
85
1.75
89.3775
90
1.75
87.6432
95
1.75
90.0723
100
1.75
88.8353
85
2.00
88.4265
90
2.00
90.1930
95
2.00
89.0571
100
2.00
85.9974
85
1.25
86.1209
90
1.25
85.8819
95
1.25
88.4381
100
1.25
87.8307
85
1.50
89.2073
90
1.50
87.2984
95
1.50
88.5071
100
1.50
90.1824
85
1.75
86.8078
90
1.75
89.1249
95
1.75
88.7684
100
1.75
88.2137
85
2.00
88.2571
90
2.00
89.9551
95
2.00
90.8301
100
2.00
693
694
Random Phenomena
S = 0.941538 R-Sq = 55.1 % R-Sq(adj) = 52.0%
Thus, the tted regression line equation is, in this case
y = 75.9 + 0.0757x1 + 3.21x2
(16.162)
16.4.3
(16.163)
(16.164)
1
W LS = XT WT WX
XT WT Wy
(16.165)
is minimized by
Regression Analysis
695
90
Yield-fit
88
100
95
86
90
1.4
1.6
Pressure
1.8
2.0
Temp
85
Versus Fits
(response is Yield)
Standardized Residual
-1
-2
86
87
88
Fitted Value
89
90
FIGURE 16.11: Catalytic process yield data of Table 16.1. Top: Fitted plane of Yield
as a function of Temperature and Pressure; Bottom: standardized residuals versus tted
value yi . Nothing appears unusual about these residuals.
696
Random Phenomena
(16.169)
(16.170)
When subject to such constraints, obtaining the least squares estimate of the
parameters in the model Eq (16.141) now requires attaching these constraint
equations to the original problem of minimizing the sum of squares function
S(). It can be shown that when the standard tools of Lagrange multipliers
are used to solve this constrained optimization problem, the solution is:
1
1
1
CLS =
+ XT X
LT L X T X
LT
(v L)
(16.171)
is the normal, unconthe constrained least squares (CLS) estimate, where
strained least-squares estimate in Eq (16.145).
If we dene a gain matrix:
1
1
1
LT L X T X
LT
(16.172)
= XT X
then, Eq (16.171) may be rearranged to:
CLS
CLS
or
+ (v L)
= v + (I L)
(16.173)
(16.174)
Regression Analysis
697
where the former (as in Eq (16.171)) emphasizes how the constraints provide a
CLS
correction to the unconstrained estimate, and the latter emphasizes that
provides a compromise between unconstrained estimates and the constraints.
Ridge Regression
given in Eq (16.145), will be unThe ordinary least squares estimate, ,
acceptable for ill-conditioned problems for which XT X is nearly singular
typically because the determinant, |XT X| 0. This will occur, for example,
when there is near-linear dependence in some of the predictor variables, xi ,
or when some of the xi variables are orders of magnitude dierent from others, and the problem has not been re-scaled accordingly. The problem created
by ill-conditioning manifests in the form of overly inated values for the elements of the matrix inverse (XT X)1 as a result of the vanishingly small
will be too
determinant. Consequently, the norm of the estimate vector, ,
large, and the uncertainty associated with the estimates (see Eq 16.151) will
be unacceptably large.
One solution is to augment the original model equation as follows:
y
X
= +
0
kI
(16.175)
or,
y = X +
(16.176)
(16.177)
698
Random Phenomena
16.4.4
Problem Formulation
A case often arises in engineering where the experimental data used to estimate parameters in the model in Eq (16.141) are available sequentially. After
accumulating a set of n observations, yi ; i = 1, 2, . . . , n, and, subsequently
using this n-dimensional data vector, yn , to obtain the parameter estimates
as:
1
n = XT X
XT y
(16.178)
suppose that a new, single observation yn+1 then becomes available. This new
data can be combined with past information to obtain an updated estimate
that reects the additional information about the parameter contained in the
data, information represented by:
yn+1 = xTn+1 + n+1
(16.179)
yn
X
n
= +
(16.180)
yn+1
n+1
xTn+1
or, more compactly,
yn+1 = Xn+1 + n+1
(16.181)
n+1
n+1 n+1
Again, in principle, we can repeat this exercise each time a new observation
becomes available. However, such a strategy requires that we recompute the
2 Hoerl A.E., and and R.W. Kennard, (1970). Ridge regression. Biased estimation for
nonorthogonal problems, Technometrics, 12, 5567.
3 Hoerl A.E., and and R.W. Kennard, (1970). Ridge regression. Applications to
nonorthogonal problems, Technometrics, 12, 6982.
4 Marquardt, D.W., (1970). Generalized inverses, Ridge regression, Biased linear estimation, and Nonlinear estimation, Technometrics, 12, 591612.
Regression Analysis
699
estimates from scratch every time as if for the rst time. While it is true that
the indicated computational burden is routinely borne nowadays by computers, the fact that the information is available recursively raises a fundamental
n+1 all over again as
question: instead of having to recompute the estimate
in Eq (16.182) every time new information is available, is it possible to de n directly with the new information? The
termine it by judiciously updating
n+1 is
answer is provided by the recursive least-squares technique whereby
n+1
Now, let us dene
(1
'
Pn+1 = XT X + xn+1 xTn+1
(16.183)
(16.184)
(16.185)
T
T
P1
n+1 = X X + xn+1 xn+1
(16.186)
(16.187)
n
= Pn+1 XT yn + Pn+1 xn+1 yn+1
(16.188)
(16.189)
700
Random Phenomena
16.5
16.5.1
Polynomial Regression
General Considerations
(16.191)
Regression Analysis
701
earlier for the more general problem transfer directly, and there is not much to
add for this restricted class of problems. However, in terms of practical application, there are some peculiarities unique to polynomial regression analysis.
In many practical problems, the starting point in polynomial regression
is often a low order linear model; when residual analysis indicates that the
simple model is inadequate, the model complexity is then increased, typically
by adding the next higher power of x, until the model is deemed adequate.
But one must be careful: tting an mth order polynomial to m + 1 data points
(e.g. tting a straight line to 2 points) will produce a perfect R2 = 1 but
the parameter estimates will be unreliable. The primary pitfall to avoid in
such an exercise is therefore overtting, whereby the polynomial model is
of an order higher than can be realistically supported by the data. Under
such circumstances, the improvement in R2 must be cross-checked against the
2
value.
corresponding Radj
The next examples illustrate the application of polynomial regression.
Example 16.9: BOILING POINT OF HYDROCARBONS:
REVISITED
In Example 16.6, a linear two-parameter model was postulated for the
relationship between the number of carbon atoms in the hydrocarbon
compounds listed in Table 16.1 and the respective boiling points. Upon
evaluation, however, the model was found to be inadequate; specically,
the residuals indicated the potential for a left over quadratic component. Postulate the following quadratic model,
Y = 0 + 1 x + 2 x 2 +
(16.192)
(16.193)
702
Random Phenomena
Note how the estimates for 0 and 1 are now dierent from the
respective values obtained when these were the only two parameters in
the model. This is a natural consequence of adding a new component
to the model; the responsibility for capturing the variability in the data
is now being shared by three parameters instead of 2, and the best
estimates of the model parameter set will change accordingly.
Before inspecting the model t and the residuals, we note rst that
the three parameters in this case also are all signicantly dierent from
zero (the p-values are zero for the constant term and the linear term
and 0.002 for the quadratic coecient). As expected, there is an improvement in R2 for this more complicated model (99.7% versus 97.4%
for the simpler linear model); furthermore, this improvement was also
2
(99.6% versus
accompanied by a commensurate improvement in Radj
97.0% for the simpler model). Thus, the improved model performance
was not achieved at the expense of overtting, indicating that the added
quadratic term is truly warranted. The error standard deviation also
shows an almost three-fold improvement from S = 17.0142 for the linear model to S = 6.33734, again indicating that more of the variability
in the data has been captured by the more complicated model.
The model t to the data, shown in the top panel of Fig 16.12,
indicates a much-improved t, compared with the one in the top panel
of Fig 16.9. This is also consistent with everything we have noted so
far. However, the normalized residuals versus tted values, plotted in
the bottom panel of Fig 16.12 shows that there is still some left over
structure, the improved t notwithstanding. The implication is that
perhaps an additional cubic term might be necessary to capture the
remaining structural information still visible in the residual plot. This
suggests further revising the model as follows:
Y = 0 + 1 x + 2 x 2 + 3 x 3 +
(16.194)
The problem of tting an adequate regression model to the data in Table 16.1
concludes with this next example.
Example 16.10: BOILING POINT OF HYDROCARBONS:
PART III
As a follow up to the analysis in the last example, t the cubic equation
Y = 0 + 1 x + 2 x 2 + 3 x 3 +
(16.195)
to the data in Table 16.1, evaluate the model t and compare it to the
t obtained in Example 16.9.
Solution:
This time around, the MINITAB results are as follows:
Regression Analysis: Boiling Point versus n, n2, n3
The regression equation is
Regression Analysis
703
S
R-Sq
R-Sq(adj)
100
6.33734
99.7%
99.6%
Boiling Point
50
0
-50
-100
-150
-200
0
4
n
Versus Fits
(response is Boiling Point)
Standardized Residual
-1
-2
-150
-100
-50
0
Fitted Value
50
100
FIGURE 16.12: Modeling the dependence of the boiling points (BP) of hydrocarbon
compounds in Table 16.1 on the number of carbon atoms in the compound: Top: Fitted
quadratic curve of BP versus n, the number of carbon atoms; Bottom: standardized
residuals versus tted value yi . Despite the good t, the visible systematic structure still
left over in the residuals suggests adding one more term to the model.
704
Random Phenomena
Boiling Point = - 244 + 93.2 n
Predictor
Coef
SE Coef
Constant
-243.643
8.095
n
93.197
7.325
n2
-9.978
1.837
n3
0.5152
0.1348
- 9.98
T
-30.10
12.72
-5.43
3.82
n2 + 0.515 n3
P
0.000
0.000
0.006
0.019
(16.196)
Note that the estimates for 0 and 1 have changed once again, as has
the estimate for 2 . Again, this is a natural consequence of adding the
new parameter, 3 , to the model.
As a result of the p-values, we conclude once again, that all the four
parameters in this model are signicantly dierent from zero; the R2
2
and Radj
values are virtually perfect and identical, indicating that the
expenditure of four parameters in this model is justied.
The error standard deviation has improved further by a factor of
almost 2 (from S = 6.33734 for the quadratic model to S = 3.28531 for
this cubic model) and the model t to the data shows this improvement
graphically in the top panel of Fig 16.13. This time, the residual plot in
the bottom panel of Fig 16.13 shows no signicant left over structure.
Therefore, in light of all the factors considered above, we conclude that
the cubic t in Eq 16.196 appears to provide an adequate t to the data;
and that this has been achieved without the expenditure of an excessive
number of parameters.
16.5.2
Regression Analysis
705
100
3.28531
99.9%
99.9%
Boiling Point
50
0
-50
-100
-150
-200
0
4
n
Versus Fits
(response is Boiling Point)
Standardized Residual
-1
-2
-200
-150
-100
-50
0
Fitted Value
50
100
FIGURE 16.13: Modeling the dependence of the boiling points (BP) of hydrocarbon
compounds in Table 16.1 on the number of carbon atoms in the compound: Top: Fitted
cubic curve of BP versus n, the number of carbon atoms; Bottom: standardized residuals
versus tted value yi . There appears to be little or no systematic structure left in the
residuals, suggesting that the cubic model provides an adequate description of the data.
706
Random Phenomena
2(k 1)
1
n1
(16.198)
The set of Gram polynomials dened on [-1,1] for xk as given above is,
p0 (xk ) =
p1 (xk ) =
p2 (xk ) =
p3 (xk ) =
..
.
p+1 (x)
1
xk
(n + 1)
3(n 1)
(3n2 7)
3
xk
xk
5(n 1)2
..
.
2 (n2 2 )
xk p (xk )
p1 (xk )
(42 1)(n 1)2
x2k
(16.199)
where each polynomial in the set is generated from the recurrence relation
in Eq (16.199), given the initial two, p0 (xk ) = 1 and p1 (xk ) = xk .
Example 16.11: ORTHOGONALITY OF GRAM POLYNOMIALS
Obtain the rst four Gram polynomials determined at n = 5 equally
spaced values, xk , of the independent variable, x, on the interval
1 x 1. Show that these polynomials are mutually orthogonal.
Solution:
First, from Eq (16.198), the values xk at which the polynomials are to
be determined in the interval 1 x 1 are:
x1 = 1; x2 = 0.5; x3 = 0; x4 = 0.5; x5 = 1.
Next, let the 5-dimensional vector, pi ; i = 0, 1, 2, 3, represent the
values of the polynomial pi (xk ) determined at these ve xk values: i.e.,
pi =
pi (x1 )
pi (x2 )
..
.
pi (x5 )
(16.200)
Regression Analysis
707
Variable
p0
p1
p2
p3
1.0
p(x)
0.5
0.0
-0.5
-1.0
1
3
k
.
Then, from
1
1
p0 =
1
1
1
1
0.50
0.15
0.5
0.25
0.30
; p1 = 0 ; p2 = 0.50 ; p3 = 0.00
0.5
0.25
0.30
1
0.50
0.15
A plot of these values is shown in Fig 16.14 where we see that p0 (xk ) is
a constant, p1 (xk ) is a straight line, p2 (xk ) is a quadratic and p3 (xk ) is
a cubic, each one evaluated at the indicated discrete points.
To establish orthogonality, we compute inner products, pTi pj , for
all combinations of i = j. First, we note that pT0 pj is simply a sum of
all the elements in each vector, which is uniformly zero in all cases, i.e.,
pT0 pj =
5
pj (xk ) = 0
(16.201)
k=1
Next, we obtain:
pT1 p2 =
5
k=1
pT1 p3 =
5
k=1
pT2 p3 =
5
k=1
708
Random Phenomena
For completeness, the sums of squares, i2 , are obtained below (note
monotonic decrease):
02 = pT0 p0 =
5
k=1
12 = pT1 p1 =
5
k=1
22 = pT2 p2 =
5
k=1
32 = pT3 p3 =
5
k=1
Application in Regression
Among many attractive properties possessed by orthogonal polynomials,
the following is the most relevant to the current discussion:
Orthogonal Basis Function Expansion: Any mth order polynomial,
U (x), can be expanded in terms of an orthogonal polynomial set,
p0 (x), p1 (x), . . . , pm (x), as the basis functions, i.e.,
U (x) =
m
i pi (x)
(16.202)
i=0
This result has some signicant implications for polynomial regression involving the single independent variable, x. Observe that as a consequence of
this result, the original mth order polynomial regression model in Eq (16.191)
can be rewritten as
Y (x) = 0 p0 (x) + 1 p1 (x) + 2 p2 (x) + + m pm (x) +
(16.203)
where we note that, given any specic set of orthogonal polynomial basis, the
one-to-one relationship between the original parameters, i , and the new set,
i , is easily determined. Regression analysis is now concerned with estimating
the new set of parameters, i , instead of the old set, i , a task that is rendered
dramatically easier because of the orthogonality of the basis set, pi (x), as we
now show.
Suppose that the data yi ; i = 1, 2, . . . , n, have been acquired using equally
spaced values xk ; k = 1, 2, . . . , n, in the range [xL , xR ] over which the orthogonal polynomial set, pi (x), is dened. In this case, from Eq (16.203), we will
Regression Analysis
709
have:
y(x1 ) =
y(x2 ) =
..
.
y(xn ) =
(16.204)
PT y,
(16.205)
looks very much like what we have seen before, until we recall that, as a result of the orthogonality of the constituent elements of P, the matrix PT P is
diagonal, with elements i2 , because all the o-diagonal terms vanish identically (see Eq (16.197) and Example 16.10)). As a result, the expression in Eq
(16.205) is nothing but a collection of n isolated algebraic equations,
n
pi (xk )y(xk )
(16.206)
i = k=1 2
i
where
i2 =
n
(16.207)
k=1
This approach has several additional advantages beyond the dramatically simplied computation:
1. Each parameter estimate,
i , is independent of the others, and its value
remains unaected by the order chosen for the polynomial model. In
other words, after obtaining the rst m parameter estimates for an mth
order polynomial model, should we decide to increase the polynomial
order to m + 1, the new parameter estimate,
m+1 , is simply obtained
as
n
pm+1 (xk )y(xk )
m+1 = k=1 2
(16.208)
m+1
using the very same data set y(xk ), and only introducing pm+1 (k), a precomputed vector of the (m+1)th polynomial. All the previously obtained
values for
i ; i = 1, 2, . . . , m, remain unchanged. This is very convenient
indeed. Recall that this is not the case with regular polynomial regression (see Examples 16.9 and 16.10) where a change in the order of the
polynomial model mandates a change in the values estimated for the
new set of parameters.
710
Random Phenomena
2. From earlier results, we know that the variance associated with the estimates,
i , is given by:
1
= PT P
2
(16.209)
and since the value for the term i2 , dened as in Eq (16.207), is determined strictly by the placement of the design points, xk , where
the data is obtained, Eq (16.210) indicates that this approach makes
it possible to select experimental points such as to inuence the variance of the estimated parameters favorably, with obvious implications
for strategic design of experiments.
3. Finally, it can be shown that i2 decreases monotonically with i, indicating that the precision with which coecients of higher order polynomials
are estimated worsens with increasing order. This is also true for regular
polynomial regression, but it is not as obvious.
An example of how orthogonal polynomial regression has been used in
engineering applications may be found in Kristinsson and Dumont, 19935 ,
and 19966 .
16.6
Regression Analysis
711
the determination of the unknown parameters contained in the functional relationship (the regression model equation), given appropriate experimental
data. The primary method discussed in this chapter for carrying out this task
is the method of least squares. However, when regression analysis is cast as
the probabilistic estimation problem that it truly is fundamentally, one can
also employ the method of maximum likelihood to determine the unknown
parameters. However, this requires the explicit specication of a probability
distribution for the observed random variabilitysomething not explicitly required by the method of least squares. Still, under the normal distribution
assumption, maximum likelihood estimates of the regression model parameters coincide precisely with least squares estimates (See Exercises 16.5 and
16.6).
In addition to the familiar, we have also presented some results for specialized problems, for example, when variances are not uniform across observations (weighted least squares); when the parameters are not independent but
are subject to (linear) constraints (constrained least squares); when the data
matrix is poorly conditioned perhaps because of collinearity (ridge regression); and when information is available sequentially (recursive least squares).
Space constraints compel us to limit the illustration and application of these
techniques to a handful of end-of-chapter exercises and application problems,
which are highly recommended to the reader.
It bears repeating, in conclusion, that since all the computations required
for regression analysis are now routinely carried out with the aid of computers,
it is all the more important to concentrate on understanding the principles
behind these computations, so that computer-generated results can be interpreted appropriately. In particular, rst, the well-informed engineer should
understand the implications of the following on the problem at hand:
the results of hypothesis tests on the signicance of estimated parameters;
2
values as measures of how much of the information
the R2 and Radj
contained in the data has been adequately explained by the regression
model, and with the expenditure of how many signicant parameters;
712
Random Phenomena
REVIEW QUESTIONS
1. In regression analysis, what is an independent variable and what is a dependent
variable?
2. In regression analysis as discussed in this chapter, which variable is deterministic
and which is random?
3. In regression analysis, what is a predictor and what is a response variable?
4. Regression analysis is concerned with what tasks?
5. What is the principle of least squares?
6. In simple linear regression, what is a one-parameter model; what is a twoparameter model?
7. What are the two main assumptions underlying regression analysis?
8. In simple linear regression, what are the normal equations and how do they
arise?
9. In simple linear regression, under what conditions are the least squares estimates
identical to the maximum likelihood estimates?
10. In regression analysis, what are residuals?
11. What does it mean that OLS estimators are unbiased?
12. Why is the condence interval around the regression line curved? Where is the
interval narrowest?
13. What does hypothesis testing entail in linear regression? What is H0 and what
is Ha in this case?
14. What is the dierence between using the regression line to estimate mean responses and using it to predict a new response?
15. Why are prediction intervals consistently wider than condence intervals?
16. What is R2 and what is its role in regression?
2
17. What is Radj
and what dierentiates it from R2 ?
Regression Analysis
713
19. In the context of simple linear regression, what is an F -test used for?
20. What is the connection between R2 , the coecient of determination, and the
correlation coecient?
21. If a regression model represents a data set adequately, what should we expect
of the residuals?
22. What does residual analysis allow us to do?
23. What activities are involved in residual analysis?
24. What are standardized residuals?
25. Why is it recommended for residual plots to be based on standardized residuals?
26. The term linear in linear regression refers to what?
27. As far as regression is concerned, how does one determine whether the problem
is linear or nonlinear?
28. What is an intrinsically linear model?
29. In employing variable transformations to convert nonlinear regression problems
to linear ones, what important issue should be taken into consideration?
30. What is multiple linear regression?
31. What is the hat matrix and what is its role in multiple linear regression?
32. What are some reasons for using weights in regression problems?
33. What is constrained least squares and what class of problems require this approach?
34. What is ridge regression and under what condition is it recommended?
35. What is the principle behind recursive least squares?
36. What is polynomial regression?
37. What is special about orthogonal polynomial regression?
38. What is the orthogonal basis function expansion result and what are its implications for polynomial regression?
714
Random Phenomena
EXERCISES
16.1 Given the one-parameter model,
yi = xi + i
where {yi }n
i=1 is the specic sample data set, and i , the random error component,
has zero mean and variance 2 , it was shown in Eq (19.49) that the least squares
estimate of the parameter is
n
x i yi
= i=1
n
2
i=1 xi
=
(i) Show that this estimate is unbiased for , i.e., E()
i=1
i=1
Regression Analysis
715
0 ,1
n
Wi [yi (0 + 1 xi )]2
i=1
and compare these to the ordinary least squares estimates obtained in Eqs (16.38)
and (16.37).
(ii) Show that E(0 ) = 0 ; and E(1 ) = 1 .
(iii) Determine V ar(0 ) and V ar(1 ).
sample from a Gaussian distribution with mean
16.5 Let Y1 , Y2 , . . . , Yn be a random
and variance, 2 ; i.e., Y N (x, ), 2 .
(i) For the one-parameter model where
= x
determine the maximum likelihood estimate of the parameter and compare it to
the least squares estimate in Eq (19.49).
(ii) When the model is
= 0 + 1 x
(the two-parameter model), determine the maximum likelihood estimates of the parameters 0 and 1 ; compare these to the corresponding least squares estimates in
Eqs (16.38) and (16.37).
16.6 Let each individual observation, Yi be an independent
sample from a Gaussian
distribution with mean and variance, i2 , i.e., Yi N (x, ), i2 ; i = 1, 2, . . . , n;
where the variances are not necessarily equal.
(i) Determine the maximum likelihood estimate of the parameter in the oneparameter model,
= x
and show that it has the form of a weighted least squares estimate. What are the
weights?
(ii) Determine the maximum likelihood estimates of the parameters 0 and 1 in the
two-parameter model,
= 0 + 1 x
Show that these are also similar to weighted least squares estimates. What are the
weights in this case?
16.7 From the denitions of the estimators for the parameters in the two-parameter
model given in Eqs (16.46) and (16.47), i.e.,
1
Sxy
Sxx
Y 1 x
obtain expressions for the respective variances for each estimator and hence establish the results given in Eqs (16.55) and (16.56).
716
Random Phenomena
16.8 A fairly common mistake in simple linear regression is the use of the oneparameter model (where the intercept is implicitly set to zero) in place of the more
general two-parameter model, thereby losing the exibility of estimating an intercept
which may or may not be zero. When the true intercept in the data is non-zero, such
a mistake will lead to an error in the estimated value of the single parameter, the
slope , because the least squares criterion has no option but to honor the implicit
constraint forcing the intercept to be zero. This will compromise the estimate of the
true slope. The resulting estimation error may be quantied explicitly as follows.
First, show that the relationship between 1 , the estimated slope in the two the estimated slope in the one-parameter model, is:
parameter model, and ,
n
x2i
y
x2
n n
1 = n i=1
2
2
2
2
x
n
x
x
n
x
x
i=1 i
i=1 i
so that the two slopes are the same if and only if
y
=
x
(16.211)
which will be the case when the intercept is truly zero. Next, show from here that
y
+ (1 )1
=
(16.212)
x
2
with = n
x2 / n
i=1 xi , indicating clearly the least squares compromisea weighted
/
y (the true slope
average of 1 , (the true slope when the intercept is not zero), and x
when the intercept is actually zero).
Finally, show that the estimation error, e = 1 will be given by:
y
(16.213)
e = 1 = 1
x
16.9 By dening as Syy the total variability present in the data, i.e., n
)2
i=1 (yi y
(see Eq (16.32)), and by rearranging this as follows:
Syy =
n
[(yi yi ) (
y yi )]2
i=1
expand and simplify this expression to establish the important result in Eq (16.89),
n
(yi y)2
or Syy
i=1
n
(
yi y)2 +
i=1
n
(yi yi )2
i=1
SSR + SSE
16.10 Identify which of the following models presents a linear or nonlinear regression
problem in estimating the unknown parameters, i .
(i) Y
(ii) Y
(iii) Y
0 + 2 x3 +
1
+
0 +
x
x
0 + 1 e + 2 sin x + 3 x +
(iv) Y
0 e2 x +
(v) Y
0 x11 x22 +
Regression Analysis
717
16.11 The following models, sampled from various branches of science and engineering, are nonlinear in the unknown parameters. Convert each into a linear regression
model; indicate an explicit relationship between the original parameters and the
transformed ones.
(i) Antoines Equation: Vapor pressure, P vap , as a function of temperature, T :
P vap = e
1
0 T +
(ii) Cellular growth rate (exponential phase): N , number of cells in the culture,
as a function of time, t.
N = 0 e1 t
(iii) Kleibers law of bioenergetics: Resting energy expenditure, Q0 , as a function
of an animals mass, M :
Q0 = 0 M 1
(iv) Gilliland-Sherwood correlation: Mass-transfer in falling liquid lms in terms
of Sh, Sherwood number, as a function of two other dimensionless numbers, Re, the
Reynolds number, and Sc, the Schmidt number:
Sh = 0 Re1 Sc2
16.12 Establish that the hat matrix,
1
XT
H = X XT X
is idempotent. Establish that (I - H) is also idempotent. As a result, establish that
= Hy, but
not only is y
H
y = Hy
Similarly, not only is e = (I H)y, but
(I H)e = (I H)y
where e is the residual vector dened in Eq (16.160).
16.13 The angles between three celestial bodies that aligned to form a triangle in
a plane at a particular point in time have been measured as y1 = 91 , y2 = 58
and y3 = 33 . Since the measurement device cannot determine these angles without
error, the results do not add up to 180 as they should; but arbitrarily forcing the
numbers to add up to 180 is ad-hoc and undesirable. Formulate the problem instead
as a constrained least squares problem
yi = i + i ; i = 1, 2, 3,
subject to the constraint:
1 + 2 + 3 = 180
and determine the least squares estimate of these angles. Conrm that these estimates add up to 180 .
16.14 When the data matrix X in the multiple regression equation
y = X +
718
Random Phenomena
is poorly conditioned, it was recommended in the text that the ridge regression
estimate:
RR = (XT X + k2 I)1 XT y
y
-1.9029
-0.2984
0.4047
0.5572
0.9662
2.0312
3.2286
5.7220
10.0952
Regression Analysis
719
APPLICATION PROBLEMS
16.18 A predictive model is sometimes evaluated by plotting its predictions directly
against the corresponding experimental data in an (x, y) plot: if the model predicts
the data adequately, a regression line t should be a 45 degree line with slope 1
and intercept 0; the residuals should appear as a random sequence of numbers that
are independent, normally distributed, and with zero mean and variance close to
measurement error variance. This technique is to be used to evaluate two models of
multicomponent transport as follows.
In Kerkhof and Geboers, (2005)7 , the authors presented a new approach to modeling multicomponent transport that is purported to yield more accurate predictions
than previously available models. The table below shows experimentally determined
viscosity (105 P a.s) of 12 dierent gas mixtures and the corresponding values predicted by two models: (i) the classical Hirschfelder-Curtiss-Bird (HCB) model8 , and
(ii) their new (KG) model.
7 Kerkhof, P.J.A.M, and M.A.M. Geboers, (2005). Toward a unied theory of isotropic
molecular transport phenomena, AIChE Journal, 51, (1), 79121
8 Hirschfelder J.O., C.F. Curtiss, and R.B. Bird (1964). Molecular Theory of Gases and
Liquids. 2nd printing. J. Wiley, New York, NY.
720
Random Phenomena
Viscosity, (105 P a.s)
Experimental
HCB
KG
Data
Predictions Predictions
2.740
2.718
2.736
2.569
2.562
2.575
2.411
2.429
2.432
2.504
2.500
2.512
3.237
3.205
3.233
3.044
3.025
3.050
2.886
2.895
2.910
2.957
2.938
2.965
3.790
3.752
3.792
3.574
3.551
3.582
3.415
3.425
3.439
3.470
3.449
3.476
(i) Treating the KG model prediction as the independent variable and the experimental data as the response variable, t a two-parameter model and thoroughly
evaluate the regression results, the parameter estimates, their signicance, R2 and
2
values, and the residuals. Plot the regression line along with a 95% condence
Radj
interval around the regression line. In light of this regression analysis, provide your
opinion about the authors claim that their model provides an excellent agreement
with the data.
(ii) Repeat (i) for the HCB model. In light of your results here and in (i), comment
on whether or not the KG model can truly be said to provide better predictions
than the HCB model.
16.19 In an attempt to quantify a possible relationship between the amount of re
damage caused by residential res and the distance from the residence to the closest
re station, the following data were acquired from a random sample of 12 recent res.
Distance from
Fire Station
x (miles)
1.8
4.6
0.7
3.4
2.3
2.6
Fire damage
y ($103 )
17.8
31.3
14.1
26.2
23.1
19.6
Distance from
Fire Station
x (miles)
5.5
3.0
4.3
1.1
3.1
2.1
Fire damage
y ($103 )
36.0
22.3
31.3
17.3
27.5
24.0
(i) Postulate an appropriate model, estimate the model parameters and evaluate
the model t.
(ii) An insurance company wishes to use this model to estimate the expected re
damage to two new houses, house A that is being built at a distance of 5 miles from
the nearest re station, and house B, 3 miles from the same re station. Determine
these estimates along with appropriate uncertainty intervals.
(iii) Is it safe to use this model to predict the re damage to a house C that is
being built 6 miles from the nearest re station? Regardless of your answer, provide
a prediction and an appropriate uncertainty interval.
16.20 Refer to Problem 16.19. Now consider that three new residential res have
Regression Analysis
721
occurred and the following additional data set has become available at the same time.
Distance from
Fire Station
x (miles)
3.8
4.8
6.1
Fire damage
y ($103 )
26.1
36.4
43.2
(i) Use recursive least squares to adjust the previously obtained set of parameter estimates in light of this new information. By how much have the parameters
changed?
(ii) Recalculate the estimated expected re damage values to houses A and B in
light of the new data; compare these values to the corresponding values obtained
in Exercise 16.19. Have these values changed by amounts that may be considered
practically important?
(iii) With this new data, is it safe to use the updated model to predict the re
damage to house C? Predict this re damage amount.
16.21 In Ogunnaike, (2006)9 , the data in the following table was used to characterize
the extent of DNA damage, , experienced by cells exposed to radiation of dose
(Gy), with the power-law relationship,
= 0 1
Radiation Dose
(Gy)
0.00
0.30
2.50
10.0
Extent of
DNA damage,
0.05
1.30
2.10
3.10
722
Random Phenomena
cooling is dependent on the purity of the ingots. Ingots of known purity were forged
into pistons, and the average number of cracked pistons per batch was recorded in
the table below.
Purity, x
Ave # of Cracked Pistons, y
0.94
4.8
0.95
4.6
0.96
3.9
0.97
3.3
0.98
2.7
0.99
2.0
Over the small range of purity, the dependence of y, the average number of
cracked pistons per batch, on purity x, may assumed to be linear, i.e.,
y = 1 x + 0
(16.214)
1
4
2
6
3
3
4
6
5
6
Knowing the average number of cracked pistons per batch expected for ingots of
96% purity via Eq (16.214) and the parameters found in part (i), determine if the
steel mills purity claim is reasonable, based on the sample of 5 ingots.
(iii) Repeat part (ii) assuming that 20 ingots from the mill were tested instead of 5
but that the mean and variance for the 20 ingots are the same as was calculated for
the sample of 5 ingots in part (ii).
16.23 The table below, rst introduced in Chapter 12, shows city and highway
gasoline mileage ratings, in miles per gallon (mpg), for 20 types of two-seater automobiles, complete with engine characteristics, capacity (in liters) and number of
cylinders.
(i) Obtain an appropriate regression models relating the number of cylinders (as x)
to highway gas mileage and to city gas mileage (as y). At the 95% signicance level,
is there a dierence between the parameters of these dierent models?
(ii) By analyzing the residuals, do these models provide reasonable explanations of
how the number of cylinders a car engine has aects the gas mileage, either in the
city or on the highway?
Regression Analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Car
Type and Model
Aston Marton V8 Vantage
Audi R8
Audi TT Roadster
BMW Z4 3.0i
BMW Z4 Roadster
Bugatti Veyron
Caddilac XLR
Chevrolet Corvette
Dodge Viper
Ferrari 599 GTB
Honda S2000
Lamborghini Murcielago
Lotus Elise/Exige
Mazda MX5
Mercedes Benz SL65 AMG
Nissan 350Z Roadster
Pontiac Solstice
Porsche Boxster-S
Porsche Cayman
Saturn SKY
Eng Capacity
(Liters)
4.3
4.2
2.0
3.0
3.2
8.0
4.4
7.0
8.4
5.9
2.2
6.5
1.8
2.0
6.0
3.5
2.4
3.4
2.7
2.0
723
# Cylinders
8
8
4
6
6
16
8
8
10
12
4
12
4
4
12
6
4
6
6
4
City
mpg
13
13
22
19
15
8
14
15
13
11
18
8
21
21
11
17
19
18
19
19
Highway
mpg
20
19
29
28
23
14
21
24
22
15
25
13
27
28
18
24
25
25
28
28
16.24 Refer to the data set in Problem 16.23. Repeat the analysis this time for
engine capacity. Examine the residuals and comment on any unusual observations.
16.25 The data in the table below shows the size of a random sample of 10 homes
(in square feet) located in the Mid-Atlantic region of the US, and the corresponding
amount of electricity used (KW-hr) monthly in each home.
From a scatter plot of the data, postulate an appropriate regression model and
estimate the parameters. Comment on the signicance of the estimated parameters.
Investigate the residuals and comment on the model t to the data. What do the
model parameters signify about how the size of a home in this region of the US
inuences the amount of electricity used?
Home Size
x (sq. ft)
1290
1350
1470
1600
1710
1840
1980
2230
2400
2930
Electricity
Usage y (KW-hr)
1182
1172
1264
1493
1571
1711
1804
1840
1956
1954
724
Random Phenomena
(ii) Using parameters estimated in (i), plot the data in terms of log(Sh/Sc2 ) vs
log Re along with your regression model and comment on the observed t.
Sh
43.7
21.5
42.9
19.8
24.2
88.0
93.0
70.6
32.3
56.0
51.6
50.7
26.1
41.3
92.8
54.2
65.5
38.2
Re
10800
5290
7700
2330
3120
14400
16300
13000
4250
8570
6620
8700
2900
4950
14800
7480
9170
4720
Sc
0.600
0.600
1.610
1.610
1.800
1.800
1.830
1.830
1.860
1.860
1.875
1.875
2.160
2.160
2.170
2.170
2.260
2.260
16.27 For ecient and protable operation, (especially during the summer months),
electrical power companies need to predict, as precisely as possible, Peak power
load (P ), dened as daily maximum amount of power required to meet demand.
The inability to predict P accurately and to provide sucient power to meet the
indicated demand is responsible in part for many blackouts/brownouts.
The data shown in the table below is a random sample of 30 daily high temperatures (T F ) and corresponding P (in megawatts) acquired between the months
of May and August in a medium-sized city.
10 Gilliland, E.R. and Sherwood, T.K., 1934. Diusion of vapors into air streams. Ind.
Engng Chem. 26, 516523.
Regression Analysis
T emp(F )
95
88
84
106
94
108
90
100
71
96
67
98
97
67
89
P (megawatts)
140.7
116.4
113.4
178.2
136.0
189.3
132.0
151.9
92.5
131.7
96.5
150.1
153.2
101.6
118.5
T emp( F )
79
76
87
92
68
85
100
74
89
86
75
70
69
82
101
725
P (megawatts)
106.2
100.2
114.7
135.1
96.3
111.4
143.6
103.9
116.5
105.1
99.6
97.7
97.6
107.3
157.6
Observed
Reectance
x
386
383
353
340
371
433
377
353
377
398
378
365
Protein
Content %
y
10.57
10.23
11.87
8.09
12.55
8.38
9.64
11.35
9.70
10.75
10.75
11.47
Observed
Reectance
x
443
450
467
451
524
407
374
391
353
445
383
404
11 Fern, A.T., (1983). A misuse of ridge regression in the calibration of a near-infra red
reectance instrument, Applied Statistics, 32, 7379
726
Random Phenomena
Obtain an expression for the calibration line relating the protein content to the
reectance measurement. Determine the signicance of the parameters. Plot the
data, model line and the 95% condence and prediction intervals. Comment objectively on how useful you expect the calibration line to be.
16.29 The following data set, obtained in an undergraduate uid mechanics lab
experiment, shows actual air ow rate measurements, determined at room temperature, 25 C, and 1 atmosphere of pressure, along with corresponding rotameter
readings.
Rotameter
Reading
x
20
40
60
80
100
120
140
Air Flow
rate cc/sec
y
15.5
38.3
50.2
72.0
111.1
115.4
139.0
(i) Determine an appropriate equation that can be used to calibrate the rotameter
and from which actual air ow rates can be determined for any given rotameter
2
values, is this a
reading. From the signicance of the parameters, the R2 and Radj
reliable expression to use as a calibration equations? From an appropriate analysis
of the residuals, comment on how carefully the experimental data were determined.
(ii) Plot the data and the regression equation along with 95% condence interval
and prediction interval bands.
(iii) Determine, along with 95% condence intervals, the expected value of the air
ow rates for rotameter readings of 70, 75, 85, 90, and 95.
16.30 The data shown in the table below, from Beck and Arnold, (1977)12 , shows
ve samples of thermal conductivity of a steel alloy as a function of temperature.
The standard deviation, i , associated with each measurement varies as indicated.
Sample
i
1
2
3
4
5
Temperature
xi ( C)
100
200
300
400
600
Thermal
Conductivity
ki W/m- C
36.3
36.3
34.6
32.9
31.2
Standard
Deviation
i
0.2
0.3
0.5
0.7
1.0
Over the indicated temperature range, the thermal conductivity varies linearly
with temperature; therefore the two-parameter model is deemed appropriate, i.e.,
ki = 0 + 1 xi + i
12 J.V. Beck and K. J. Arnold, (1977). Parameter Estimation in Engineering and Science,
J. Wiley, NY, p209.
Regression Analysis
727
(i) Determine the weighted least squares estimates of the parameters 0 and 1 using
as weights, wi = 1/i .
(ii) Determine ordinary least squares estimates of the same parameters using no
weights. Compare these estimates.
(iii) Plot the data along with the two regression equations obtained above. Which
one ts the data better?
16.31 The data table below is typical of standard tables of thermophysical properties of liquids and gases used widely in chemical engineering practice, especially in
process simulation. This specic data set shows the temperature dependence of the
heat capacity Cp of methylcyclohexane.
Temperature
Kelvin
150
160
170
180
190
200
210
220
Heat Capacity
Cp , KJ/kg K
1.426
1.447
1.469
1.492
1.516
1.541
1.567
1.596
Temperature
Kelvin
230
240
250
260
270
280
290
300
Heat Capacity
Cp , KJ/kg K
1.627
1.661
1.696
1.732
1.770
1.801
1.848
1.888
C p = 0 + 1 T + 2 T 2
2
Again, check the signicance of the parameters, the residuals, and the R2 and Radj
values. Finally, t the cubic equation,
C p = 0 + 1 T + 2 T 2 + 3 T 3
Once more, check the signicance of the parameters, the residuals, and the R2 and
2
values.
Radj
Which of the three models is more appropriate to use as an empirical relationship
representing how the heat capacity of methylcyclohexane changes with temperature?
Plot the data and the regression curve of the selected model, along with the 95%
condence and prediction intervals.
16.32 The change in the bottoms temperature of a binary distillation column in
response to a pulse input in the steam ow rate to the reboiler is represented by the
following equation:
AK t/
e
(16.215)
y = T T0 =
where T0 is the initial (steady state) temperature before the perturbation; A is the
magnitude of the pulse input, idealized as a perfect delta function; t is time, and K
and are, respectively, the process gain, and time constant, unknown parameters
728
Random Phenomena
of the process when the dynamics are approximated as a rst order system13 . A
process identication experiment performed to estimate K and yielded the data
in the following table, starting from an initial temperature of 185 C, and using an
impulse input of magnitude A = 10.
Time
t (min)
0
1
2
3
5
10
15
Bottoms
Temperature
T ( C)
189.02
188.28
187.66
187.24
186.54
185.46
185.20
Even though Eq (16.215) presents a nonlinear regression problem, it is possible, via an appropriate variable transformation, to convert it to a linear regression
equation. Use an appropriate transformation and obtain an estimate of the process
parameters from the provided data.
Prior to performing the experiment, an experienced plant operator stated that
historically, in the operating range in question, the process parameters have been
characterized as K 2 and 5. How close are these guesstimates to the actual
estimated values?
16.33 Fit Antoines equation,
P vap = e
1
0 T +
to the data in the table below, which shows the temperature dependence of vapor
pressure for Toluene.
P vap (mm Hg)
Toluene
5
10
20
40
60
100
200
400
760
1520
T ( C)
-4.4
6.4
18.4
31.8
40.3
51.9
69.5
89.5
110.6
136.5
Use the tted model to interpolate and obtain expected values for the vapor
pressure of Toluene at the following temperatures: 0, 25, 50, 75, 100, and 125 ( C).
Since using linear regression to t the equation to the data will require a variable
transformation, obtain 95% condence intervals for these expected values rst in
13 B.A. Ogunnaike and W. H. Ray, (1994). Process Dynamics, Modeling and Control,
Oxford University Press, NY. Chapter 5.
Regression Analysis
729
the transformed variables and convert to approx condence interval in the original
variables.
16.34 In September 2007, two graduate students14 studying at the African Institute of Mathematical Sciences (AIMS) in Muizenberg, South Africa, took the following measurements of wingspan, (the ngertip-to-ngertip length of outstretched
hands) and height for 36 of their classmates.
Wingspan(cm)
182.50
167.50
175.00
163.00
186.50
168.50
166.50
156.00
153.00
170.50
164.50
170.50
173.00
189.00
179.50
174.50
186.00
192.00
Height(cm)
171.00
161.50
170.00
164.00
180.00
162.50
158.00
157.00
156.00
162.00
157.50
165.50
164.00
182.00
174.00
165.00
175.00
188.00
Wingspan(cm)
165.50
193.00
198.00
181.50
154.00
168.00
174.00
180.00
173.00
188.00
188.00
180.00
160.00
200.00
177.00
179.00
197.00
168.50
Height(cm)
158.00
189.50
183.00
181.00
157.00
165.00
166.50
172.00
171.50
179.00
176.00
178.00
163.00
184.00
180.00
169.00
183.00
165.00
730
Random Phenomena
retical value.
Species
Camelus dromedarius
Sus scrofa
Tragulus javanicus
Ailurus fulgens
Arctitis binturong
Canis latrans
Herpestes auropunctatus
Meles meles
Mustela frenata
Anoura caudifer
Chrotopterus auritus
Eptesicus fuscus
Macroderma gigas
Noctilio leporinus
Myrmecophaga tridactyla
Priodontes maximus
Crocidura suaveolens
Didelphis marsupialis
Lasiorhinus latifrons
Elephas maximus
Body Mass
M (g)
407000.00
135000.00
1613.00
5740.00
14280.00
10000.00
611.00
11050.00
225.00
11.50
96.10
16.90
148.00
61.00
30600.00
45190.00
7.50
1329.00
25000.00
3672 000.00
BMR
Q0 (Watts)
229.18
104.15
4.90
5.11
12.54
14.98
2.27
16.80
1.39
0.24
0.80
0.11
0.78
0.40
14.65
17.05
0.12
3.44
14.08
2336.50
Chapter 17
Probability Model Validation
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.2 Probability Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.2.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.2.2 Transformations and Specialized Graph Papers . . . . . . . . . . . . . . . . . .
17.2.3 Modern Probability Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Safety Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yield Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Residual Analysis for Regression Model . . . . . . . . . . . . . . . . . . . . . . . .
Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.3 Chi-Squared Goodness-of-t Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.3.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.3.2 Properties and Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Poisson Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Binomial Special Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.4 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
732
732
733
734
736
736
737
737
737
737
739
739
741
742
743
745
746
747
750
In his pithy statement, All models are wrong but some are useful, the legendary George E. P. Box of Wisconsin was employing hyperbole to make a
subtle but important point. The point, well-known to engineers, is that perfection is not a prerequisite for usefulness in modeling (in fact, it can be an impediment). If complex, real-world problems are to become tractable, idealizing
assumptions are inevitable. But what is thus given up in perfection is more
than made up for in usefulness, so long as the assumptions can be validated
as reasonable. As a result, assessing the reasonableness of inevitable assumptions is or ought to be an important part of the modeling exercise; and
this chapter is concerned with presenting some techniques for doing just that
validating distributional assumptions. We focus specically on probability
plots and the chi-squared goodness-of-t test, two time-tested techniques that
also happen to complement each other perfectly in such a way that, with one
731
732
Random Phenomena
or the other, we are able to deal with both discrete and continuous probability
models.
17.1
Introduction
When confronted with a problem involving a randomly varying phenomenon, the approach we have advocated thus far involves rst characterizing
the random random phenomenon in question with an ideal probability model
in the form of the pdf, f (x; ), and then using the model to solve the problem
at hand. As we have seen in the preceding chapters, fully characterizing the
random phenomenon itself involves rst (i) postulating a candidate probability
model (for example, the Poisson model for the glass inclusions data of Chapter
1) based on an understanding of the underlying phenomenon; and then, (ii)
using statistical inference techniques to obtain (point and interval) estimates
of the unknown parameter vector based on sample data, X1 , X2 , . . . , Xn .
However, before proceeding to use the postulated model to solve the problem
at hand, it is always advisable to check to be sure that the model and the
implied underlying assumptions are reasonable. The question of interest is
therefore as follows:
Given sample data, X1 , X2 , . . . , Xn , obtained from a random variable, X, with postulated pdf, f (x), (and a corresponding cumulative distribution function (cdf), F (x)), is the postulated probability model reasonable?
17.2
17.2.1
733
Probability Plots
Basic Principles
Consider that specic experimental data, x1 , x2 , . . . , xn , have been obtained from a population whose pdf is postulated as f (x) (so that the corresponding cdf, F (x) is also known). To use this data set to check the validity of
such a distributional assumption, we start by ordering the data from the smallest to the largest, as x(1) , x(2) , . . . , x(n) , such that: x(1) x(2) . . . x(n) , i.e.,
x(1) is the smallest of the set, followed by x(2) , etc., with x(n) as the largest. For
example, one of the data sets on the waiting time (in days) until the occurrence
of a recordable safety incident in a certain companys manufacturing site, was
given in Example 14.3 as S1 = {16, 1, 9, 34, 63, 44, 1, 63, 41, 29}, a sample of
size n = 10. This was postulated to be from an exponential distribution. When
rank ordered, this data set will be S1r = {1, 1, 9, 16, 29, 34, 41, 44, 63, 63}.
Now, observe that because of random variability, X(i)n , the ith ranked
observation of an n-sample set, will be a random variable whose observed value
will change from one sample to the next. For example, we recall from the same
Example 14.3 that a second sample of size 10 obtained from the same company
the following year, was given as: S2 = {35, 26, 16, 23, 54, 13, 100, 1, 30, 31}; it
is also considered to be from the same exponential distribution. When rank
ordered, this data set yields: S2r = {1, 13, 16, 23, 26, 30, 31, 35, 54, 100}. Note
that the value for x(5) , the fth ranked from below, is 29 for S1 , but 26 for
S2 .
Now, dene the expected value of X(i)n as:
E X(i)n = (i)n
(17.1)
This quantity can be computed for any given f (x), in much the same way
that we are able to compute the expectation, E(X), of the regular, unranked
random variable. The fundamental principle behind probability plots is that
if the sample data set truly came from a population with the postulated pdf,
then a plot of the ordered sample observations, x(i) , against their respective
expected values, (i)n , will lie on a straight line, with deviations due only
to random variability. Any signicant departure from this straight line will
indicate that the distributional assumption is not true.
However, with the exception of the very
simplest of pdfs, obtaining an
exact closed form expression for E X(i)n = (i)n is not a trivial exercise.
Nevertheless, using techniques that lie outside the intended scope of this book,
it is possible to show that the expression
i
1
E X(i)n = (i)n = F
; i = 1, 2, . . . , n
(17.2)
n 2 + 1
is a very good approximation that is valid for all pdfs (the constant is dened
734
Random Phenomena
in which case
!
"
P X E X(i)n =
i
(17.4)
n 2 + 1
For our purposes here, the most important implication of this result is that if
i is the rank of the rank ordered observation, x(i) , from a population with a
postulated pdf f (x), then the associated theoretical cumulative probability is
(i)
100% percentile determined
(i)/(n2+1). In other words, the (n2+1)
from the theoretical cdf is E X(i)n .
The constant, , depends on sample size, n, and on f (x). However, for all
practical purposes, the value = 0.5 has been found to work quite well for
a wide variety of distributions, the exception being the uniform
distribution,
for which a closed form expression is easily obtained as E X(i)n = (i)n =
i/(n + 1), so that in this case, the appropriate value is = 0.
Observe in summary, therefore, that the principle behind the probability
plot calls for rank ordering the data, then plotting the rank ordered data, x(i) ,
versus its expected value, (i) (where for convenience, we have dropped the
indicator of sample size). From Eq (17.3), obtaining (i) requires computing
the value of the (i )/(n 2 + 1) quantile from the theoretical cdf, F (x).
A plot of (i) on a regular, uniform scale cartesian y-axis, against x(i) on the
similarly scaled cartesian x-axis, will show a straight line relationship when
the underlying assumptions are reasonable.
17.2.2
735
TABLE 17.1:
resulting probability plots used to test for normality, are routinely referred to
as normal plots or normal probability plots. Corresponding graph papers
exist for the exponential, gamma, and a few other distributions.
The advent of modern computer software packages has not only made
these graphs obsolete; it has also made it possible for the technique to be
more objective. But before proceeding to discuss the modern approach, we
illustrate mechanics of the traditional approach to probability plotting rst
with the following example.
Example 17.1: PROBABILITY PLOTTING FOR SAFETY
INCIDENT DATA
Given the data set S1 = {16, 1, 9, 34, 63, 44, 1, 63, 41, 29} for the waiting
time (in days) until the occurrence of a recordable safety incident, generate the data table needed for constructing a probability plot for this
putative exponentially distributed random variable. Use = 0.5.
Solution:
Upon rank-ordering the data, we are able to generate Table 17.1 using
qi dened as:
i 0.5
(17.6)
qi =
10
A plot of x(i) directly versus qi on exponential probability paper should
yield a straight line if the underlying distribution is truly exponential.
Note that without the probability paper, it will be necessary to obtain
(i) rst from
(17.7)
F (x) = 1 ex
the cdf for the E() random variable, i.e.,
(i) =
1
ln (1 qi )
(17.8)
and x(i) may then be plotted against (i) on regular graph paper using
regular uniform scales on each axis.
736
Random Phenomena
17.2.3
17.2.4
737
Applications
738
Random Phenomena
Probability Plot of X1
Exponential - 95% CI
99
Mean
N
AD
P-Value
90
Percent
80
70
60
50
40
30.10
10
0.638
0.298
30
20
10
5
3
2
1
0.1
1.0
10.0
X1
100.0
Probability Plot of X2
Exponential - 95% CI
99
Mean
N
AD
P-Value
90
Percent
80
70
60
50
40
32.9
10
0.536
0.412
30
20
10
5
3
2
1
0.1
1.0
10.0
X2
100.0
1000.0
FIGURE 17.1: Probability plots for safety data postulated to be exponentially distributed, each showing (a) rank ordered data; (b) theoretical tted cumulative probability distribution line along with associated 95% condence intervals; (c) a list of summary
statistics, including the p-value associated with a formal goodness-of-t test. The indication from the p-values is that there is no evidence to reject H0 ; therefore the model
appears to be adequate
739
Probability Plot of X2
Normal - 95% CI
99
Mean
StDev
N
AD
P-Value
95
90
32.9
27.51
10
0.701
0.045
Percent
80
70
60
50
40
30
20
10
5
-50
50
100
150
X2
FIGURE 17.2: Probability plot for safety data S2 wrongly postulated to be normally distributed. The departure from the linear t does not appear too severe, but
the low/borderline p-value (0.045) objectively compels us to reject H0 at the 0.05 signicance level and conclude that the Gaussian model is inadequate for this data.
Others
In addition to the probability plots illustrated above for exponential, and
Gaussian distributions, MINITAB can also generate probability plots for several other distributions, including lognormal, gamma, and Weibull distributions, all continuous distributions.
Probability plots are not used for discrete probability models in part because the associated cdfs consist of a series of discontinuous step functions,
not smooth curves like continuous random variable cdfs. To check the validity
of discrete distributions such as the binomial and Poisson, it is necessary to
use the more versatile technique discussed next.
17.3
17.3.1
740
Random Phenomena
Probability Plot of YA
Normal - 95% CI
99
Mean
StDev
N
AD
P-Value
95
90
75.52
1.432
50
0.566
0.135
Percent
80
70
60
50
40
30
20
10
5
71
72
73
74
75
76
77
78
79
80
YA
Probability Plot of YB
Normal - 95% CI
99
Mean
StDev
N
AD
P-Value
95
90
72.47
2.764
50
0.392
0.366
Percent
80
70
60
50
40
30
20
10
5
65.0
67.5
70.0
72.5
YB
75.0
77.5
80.0
82.5
FIGURE 17.3: Probability plots for yield data sets YA and YB postulated to be normally
distributed. The 95% condence intervals around the tted line, along with the indicated
p-values, strongly suggest that the distributional assumptions appear to be valid.
741
95
90
Percent
80
70
60
50
40
30
20
10
5
-3
-2
-1
0
1
Standardized Residual
FIGURE 17.4: Normal probability plot for the residuals of the regression analysis of the
dependence of thermal conductivity, k, on Temperature in Example 16.5. The postulated
model, a two-parameter regression model with Gaussian distributed zero mean errors,
appears valid.
how histograms are generated (see Chapter 12). From the postulated probability model, and its p parameters estimated from the sample data, the
theoretical (i.e., expected) frequency associated with each of the m groups,
i ; i = 1, 2, . . . , m, is then computed. If the postulated model is correct, the
observed and expected frequencies should be close. Because the observed frequencies are subject to random variability, their closeness to the corresponding theoretical expectations, quantied by,
C2 =
m
2
(f o i )
i
i=1
(17.9)
(17.10)
742
Random Phenomena
17.3.2
Observed
Frequency
22
23
11
4
Poisson
f (x|)
0.3618
0.3678
0.1870
0.0834
743
Expected
Frequency
21.708
22.070
11.219
5.004
=
=
(23 22.070)2
(11 11.219)2
(4 5.004)2
(22 21.708)2
+
+
+
21.708
22.070
11.219
5.004
0.249
(17.11)
(17.12)
744
Random Phenomena
Expected
Observ ed
20
Value
15
10
0
Inclusions
>=3
Contributed Value
0.20
0.15
0.10
0.05
0.00
>=3
Inclusions
FIGURE 17.5: Chi-Squared test results for inclusions data and a postulated Poisson
model. Top panel: Bar chart of Expected and Observed frequencies, which shows
how well the model prediction matches observed data; Bottom Panel: Bar chart of
contributions to the Chi-squared statistic, showing that the group of 3 or more inclusions
is responsible for the largest model-observation discrepancy, by a wide margin.
745
[(n x) nq]2
(x np)2
+
np
nq
(17.13)
upon introducing q = 1 p for the rst term in the numerator and taking
advantage of the dierence of two squares result in algebra, the right hand
side of the equation rearranges easily to give the result:
C2 =
(x np)2
npq
(17.14)
(17.15)
17.4
This chapter has been primarily concerned with examining two methods
for validating probability models: modern probability plots and the chi-square
goodness-of-t test. While we presented the principles behind these methods,
we concentrated more on applying them, particularly with the aid of computer
programs. With some perspective, we may now observe the following as the
main points of the chapter:
Probability plots augmented with theoretical model ts and p-values are
most appropriate for continuous models;
746
Random Phenomena
Chi-squared tests, on the other hand, are more naturally suited to discrete models (although they can also be applied to continuous models
after appropriate discretization).
As a practical matter, it is important to keep in mind that, just as with
other hypotheses tests, a postulated probability model can never be completely proven adequate by these tests (on the basis of nite sample data),
but inadequate models can be successfully identied as such. Still, it can be
dicult to identify inadequate models with these tests when sample sizes are
small; our chances of identifying inadequate models correctly as inadequate
improve signicantly as n . Therefore, as much sample data as possible should be used to validate probability models; and wherever possible, the
data set used to validate a model should be collected independently of that
used to estimate the parameters. Some of the end-of-chapter exercises and
applications problems are used to reinforce these points.
Finally, it must be kept in mind always that no model is (or can ever
be) perfect. The nal decision about the validity of the model assumptions
rests with the practitionerthe person who will ultimately use these models
for problem solvingand these tests should be considered properly only as
objective guides, not as nal and absolute arbiters.
REVIEW QUESTIONS
1. What is the primary question of interest in probability model validation?
2. What are the two approaches discussed in this chapter for validating probability
models?
3. Which approach is better suited to continuous probability models and which one
is applicable most directly to discrete probability models?
4. What is the fundamental principle behind probability plots?
5. What is the fundamental concept behind old-fashioned probability plots?
6. What hypothesis test accompanies modern probability plots?
7. What does a modern probability plot consist of?
8. Why are probability plots not used for discrete probability models?
9. What is a chi-squared goodness-of-t test?
747
EXERCISES
17.1 In Example 16.5, a two-parameter regression model of how a metals thermal
conductivity varies with temperature was developed from the data shown here again
for ease of reference.
k (W/m- C)
93.228
92.563
99.409
101.590
111.535
115.874
119.390
126.615
Temperature ( C)
100
150
200
250
300
350
400
450
The two-parameter model was postulated for the relationship between k and
T with the implicit assumption that the errors are normally distributed. Obtain
the residuals from the least-squares t and generate a normal probability plot (and
ancillary analysis) of these residuals. Comment on the validity of the normality assumption for the regression model errors.
17.2 In Problem 16.28, the data in the table below1 was presented as the basis for
calibrating a near-infrared instrument to be used to determine protein content in
wheat from reectance measurements.
For this data set to produce a useful calibration curve, the regression model must
be adequate; and an important aspect of regression model adequacy is the nature
of its residuals. In this particular case, the residuals are required to be random and
approximately normally distributed. By analyzing the residuals from the regression
exercise appropriately, comment on whether or not the resulting regression model
should be considered as adequate.
748
Random Phenomena
Protein
Content %
y
9.23
8.01
10.95
11.67
10.41
9.51
8.67
7.75
8.05
11.39
9.95
8.25
Observed
Reectance
x
386
383
353
340
371
433
377
353
377
398
378
365
Protein
Content %
y
10.57
10.23
11.87
8.09
12.55
8.38
9.64
11.35
9.70
10.75
10.75
11.47
Observed
Reectance
x
443
450
467
451
524
407
374
391
353
445
383
404
17.3 The following data is postulated to have been sampled from an exponential
E (4) population. Validate the postulated model appropriately. Repeat the validation
exercise as if the population parameter was unknown and hence must be estimated
from the sample data. Does knowing the population parameter independently make
any dierence in this particular case?
6.99
0.52
10.36
5.75
2.84
0.67
1.66
0.12
0.41
2.72
3.26
6.51
3.75
5.22
1.78
4.05
2.16
16.65
1.31
1.52
17.4 The data below are random samples from two independent lognormal distributions; specically, XL1 L(0, 0.25) and XL2 L(0.25, 0.25).
XL1
0.81693
0.96201
1.03327
0.84046
1.06731
1.34118
0.77619
1.14027
1.27021
1.69466
XL2
1.61889
1.15897
1.17163
1.09065
1.27686
0.91838
1.45123
1.47800
2.16068
1.46116
(i) Test the validity of these statements directly from the data as presented.
(ii) Test the validity of these statements indirectly by taking a logarithmic transformation of the data, and carrying out an appropriate analysis of the resulting
log-transformed data. Compare the results with those obtained in (i).
17.5 If X1 is a lognormal random variable with parameters (, 1 ), and X2 is a
lognormal random variable with parameters (, 2 ), it has been postulated that the
product:
Y = X1 X2
749
X2
1.61889
1.15897
1.17163
1.09065
1.27686
0.91838
1.45123
1.47800
2.16068
1.46116
From this data set, obtain the corresponding values for Y dened as the product
Y = X1 X2 . According to the result stated and proved in (i), what is the theoretical
distribution of Y ? Conrm that the computed sample data set for Y agrees with
this postulate.
17.6 The data in the following table (Exercise 12.12) shows samples of size n = 20
drawn from four dierent populations postulated to be normal, N , lognormal L,
gamma G, and inverse gamma I, respectively.
XN
9.3745
8.8632
11.4943
9.5733
9.1542
9.0992
10.2631
9.8737
7.8192
10.4691
9.6981
10.5911
11.6526
10.4502
10.0772
10.2932
11.7755
9.3790
9.9202
10.9067
XL
7.9128
5.9166
4.5327
33.2631
24.1327
5.4151
16.9556
3.9345
35.0376
25.1182
1.1804
2.3503
15.6894
5.8929
8.0254
16.1482
0.6848
6.6974
3.6909
34.2152
XG
10.0896
15.7336
15.0422
5.5482
18.0393
17.9543
12.5549
9.6640
14.2975
4.2599
19.1084
7.0735
7.6392
14.1899
13.8996
9.7680
8.5779
7.5486
10.4043
14.8254
XI
0.084029
0.174586
0.130492
0.115567
0.187260
0.100054
0.101405
0.100835
0.097173
0.141233
0.060470
0.127663
0.074183
0.086606
0.084915
0.242657
0.052291
0.116172
0.084339
0.205748
750
Random Phenomena
(i) Validate these postulates using the full data sets. Note that the population parameters have not been specied.
(ii) Using only the top half of each data set, repeat (i). For this particular example, what eect, if any, does sample size have on the probability plots approach to
probability model validation?
17.7 The data in the table below was presented in Exercise 15.18 as a random sample of 15 observations each from two normal populations with unknown means and
variances. Test the validity of the normality assumption for each data set. Interpret
your results.
Sample
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
X
12.03
13.01
9.75
11.03
5.81
9.28
7.63
5.70
11.75
6.28
12.53
10.22
7.17
11.36
9.16
Y
13.74
13.59
10.75
12.95
7.12
11.38
8.69
6.39
12.01
7.15
13.47
11.57
8.81
13.10
11.32
APPLICATION PROBLEMS
17.8 The following data set, from a study by Lucas (1985)2 , shows the number of
accidents occurring per quarter (three months) at a DuPont company facility, over
a 10-year period. The data set has been partitioned into two periods: Period I is the
rst ve-year period of the study; Period II, the second ve-year period.
5
4
2
5
6
Period I
5 10 8
5 7
3
8 6
9
6 5 10
3 3 10
3
1
7
1
4
Period II
4 2 0
3 2 2
7 1 4
2 2 1
4 4 4
When the data set was presented in Problem 13.22, it was simply stated as a
matter of fact that a Poisson pdf was a reasonable model for representing this data.
2 Lucas
751
Check the validity of this assumption for Period I data and for Period II data separately.
17.9 Refer to Problem 17.8. One means by which one can test if two samples are
truly from the same population is to compare the empirical distributions obtained
from each data set against each other directly; if the two samples are truly from
the same distribution, within the limits of experimental error, there should be no
signicant dierence between the two empirical distributions. State an appropriate
null hypothesis and the alternative hypothesis and carry out a chi-squared goodnessof-t test of the empirical distribution of the data for Period I versus that of Period
II. State your conclusions clearly.
17.10 The table below (see Problem 9.40) shows frequency data on distances between DNA replication origins (inter-origin distances), measured in vivo in Chinese Hamster Ovary (CHO) cells by Li et al., (2003)3 , as reported in Chapter 7 of
Birtwistle (2008)4 . Phenomenologically, inter-origin distance should be a gamma distributed random variable and this data set has been analyzed in Birtwistle, (2008),
on this basis. Carry out a formal test to validate the gamma model assumption.
Interpret your results.
Inter-Origin
Distance (kb)
x
0
15
30
45
60
75
90
105
120
135
150
165
Relative
Frequency
fr (x)
0.00
0.02
0.20
0.32
0.16
0.08
0.11
0.03
0.02
0.01
0.00
0.01
17.11 The time in months between occurrences of safety violations in a toll manufacturing facility is shown in the table below for three operators, A, B, C.
A
B
C
1.31
1.94
0.79
0.15
3.21
1.22
3.02
2.91
0.65
3.17
1.66
3.90
4.84
1.51
0.18
0.71
0.30
0.57
0.70
0.05
7.26
1.41
1.62
0.43
2.68
6.75
0.96
0.68
1.29
3.76
It is customary to postulate an exponential probability model for this phenomenon. Is this a reasonable postulate for each data set in this collection? Support
your answer adequately.
3 Li, F., Chen, J., Solessio, E. and Gilbert, D. M. (2003). Spatial distribution and specication of mammalian replication origins during G1 phase. J Cell Biol 161, 257-66.
4 M. R. Birtwistle, (2008). Modeling and Analysis of the ErbB Signaling Network: From
Single Cells to Tumorigenesis, PhD Dissertation, University of Delaware.
752
Random Phenomena
17.12 The data table below (also presented in Problem 8.26) shows x, a count of the
number of species, x = 1, 2, . . . 24, and the associated number of Malayan butteries
that have x number of species. When the data was rst published and analyzed in
Fisher et al., (1943)5 , the logarithmic series distribution (see Exercise 8.13), with
the pdf,
f (x) =
px
; 0 < p < 1; x = 1, 2, . . . ,
x
where
=
1
ln(1 p)
was proposed as the appropriate model for the phenomenon in question. This pdf
has since become the model of choice for data involving this phenomenon.
x
No of species
(x)
Observed Frequency
x
No of species
(x)
Observed Frequency
x
No of species
(x)
Observed Frequency
118
74
44
24
29
22
20
19
10
11
12
13
14
15
16
20
15
12
14
12
17
18
19
20
21
22
23
24
10
10
11
Formally validate this model for this specic data set. What is the p-value associated with the test? What does it indicate about the validity of this model?
17.13 The data in the table below is the time-to-publication of 85 papers published
in the January 2004 issue of a leading chemical engineering research journal. (See
Problem 1.13).
5 Fisher, R. A., S. Corbet, and C. B. Williams. (1943). The relation between the number
of species and the number of individuals in a random sample of an animal population.
Journal of Animal Ecology, 1943: 4258.
15.1
5.3
12.0
3.0
18.5
5.8
6.8
14.5
3.3
11.1
16.4
7.3
7.4
7.3
5.2
10.2
3.1
9.6
12.9
17.3
6.0
24.3
21.3
19.3
2.5
9.1
8.1
9.8
15.4
15.7
8.2
8.8
7.2
12.8
4.2
4.2
7.8
9.5
3.9
8.7
5.9
5.3
1.8
10.1
10.0
18.7
5.6
3.3
7.3
11.3
2.9
753
5.4
15.2
8.0
11.7
17.2
4.0
3.8
7.4
5.3
10.6
15.2
11.5
5.9
20.1
12.2
12.0
8.8
Percent of Population
with income level, x
4
13
17
20
16
12
7
4
3
2
1
1
It has been postulated that the lognormal distribution is a reasonable model for
this phenomenon. Carry out an appropriate test to conrm or refute this postulate.
Keep in mind that the data is not in raw form, but has already been processed
into a frequency table. Interpret your results.
754
Random Phenomena
17.16 The appropriate analysis of over-dispersed Poisson phenomena with the negative binomial distribution was pioneered with the classic data and analysis of Greenwood and Yule (1920) data6 . The data in question, shown in the table below (see
Problem 8.28), is the frequency of accidents occurring, over a ve-week period, to
647 women making high explosives during World War I.
Number
of Accidents
0
1
2
3
4
5+
Observed
Frequency
447
132
42
21
3
2
First, determine from the data, parameter estimates for a Poisson model and
then determine k and p for a negative binomial model (see Problem 14.40). Next,
conduct formal chi-squared goodness-of-t tests for the Poisson model and then for
the negative binomial. Interpret your test results. From your analysis, which model
is more appropriate for this data set?.
17.17 Mee (1990)7 presented the following data on the wall thickness (in ins) of cast
aluminum cylinder heads used in aircraft engine cooling jackets. When presented in
Problem 14.44 (and following the maximum entropy arguments of Problem 10.15),
the data was assumed to be a random sample from a normal population. Validate
this normality assumption and comment on whether or not this is a reasonable assumption.
0.223
0.201
0.228
0.223
0.214
0.224
0.193
0.231
0.223
0.237
0.213
0.217
0.218
0.204
0.215
0.226
0.233
0.219
17.18 A sample of 20 silicon wafers selected and examined for aws produced the
result (the number of aws found on each wafer) shown in the following table.
3
4
0
1
0
2
2
3
3
2
0
1
3
2
2
4
1
0
2
1
When this data set was rst presented in Problem 12.20, it was suggested that the
Poisson model is reasonable for problems of this type. For this particular problem,
however, is this a reasonable model? Interpret your results.
17.19 According to census records, the age distribution of the inhabitants of the
United States in 1960 and in 1980 is as shown in the table below.
6 Greenwood M. and Yule, G. U. (1920) An enquiry into the nature of frequency distributions representative of multiple happenings with particular reference of multiple attacks
of disease or of repeated accidents. Journal Royal Statistical Society 83:255-279.
7 Mee, R. W., (1990). An improved procedure for screening based on a correlated, normally distributed variable, Technometrics, 32, 331337.
1960
20,321
18,692
16,773
13,219
10,801
10,869
11,949
12,481
11,600
10,879
9,606
8,430
7,142
16,560
755
1980
16,348
16,700
18,242
21,168
21,319
19,521
17,561
13,965
11,669
11,090
11,710
11,615
10,088
25,550
(i) It is typical to assume that such data are normally distributed. Is this a reasonable assumption in each case? (ii) Visually, the two distributions appear dierent.
But are they signicantly so? Carry out an appropriate test to check the validity of
any assumption of equality of these two age distributions.
17.20 In Problem 13.34 and in Example 15.1, it was assumed that the following
data, two sets of random samples of trainee scores from large groups of trainees
instructed by Method A and Method B, are both normally distributed.
Method A
Method B
71
72
75
77
65
84
69
78
73
69
66
70
68
77
71
73
74
65
68
75
Carry out an appropriate test and conrm whether or not such an assumption
is justied.
17.21 The data below is the computed fractional intensity, = Itest /(Itest + Iref ),
for a collection of special genes (known as housekeeping genes), where Itest is the
measured uorescence intensity under test conditions, and Iref , the intensity under
reference conditions. If these 10 genes are true housekeeping genes, then within the
limits of measurement noise, the computed values of should come from a symmetric Beta distribution with mean value 0.5. Use the method of moments to estimate
parameter values for the postulated Beta B(, ) distribution. Carry out an appropriate test to validate the Beta model hypothesis.
i
0.585978
0.504057
0.182831
0.426575
0.455191
0.804720
0.741598
0.332909
0.532131
0.610620
756
Random Phenomena
17.22 Padgett and Spurrier (1990)8 obtained the following data set for the breaking
strengths (in GPa) of carbon bers used in making composite materials.
1.4
3.2
2.2
1.8
1.6
3.7
1.6
1.2
0.4
1.1
3.0
0.8
5.1
3.7
2.0
1.4
5.6
2.5
2.5
1.6
1.0
1.7
1.2
0.9
2.1
2.8
1.6
3.5
1.6
1.9
4.9
2.0
2.2
2.8
2.9
3.7
1.2
1.7
4.7
2.8
1.8
1.1
1.3
2.0
2.1
1.6
1.7
4.4
1.8
3.7
(i) In their analysis, Padgett and Spurrier postulated a Weibull W (, ) distribution model with parameters = 2.0 and = 2.5 for the phenomenon in question.
Validate this model assumption by carrying out an appropriate test.
(ii) Had the model parameters not been given, so that their values must be determined from the data, repeat the test in (i) and compare the results. What does this
imply about the importance of obtaining independent parameter estimates before
carrying out probability model validation tests?
17.23 The data set below, from Holmes and Mergen (1992)9 , is a sample of viscosity
measurements taken from ten consecutive, but independent, batches of a product
made in a batch chemical process.
S10 = {13.3, 14.5, 15.3, 15.3, 14.3, 14.8, 15.2, 14.9, 14.6, 14.1}
Part of the assumption in the application noted in the reference is that this data
constitutes a random sample from a normal population with unknown mean and
unknown variance. Conrm whether or not this is a reasonable assumption.
Chapter 18
Nonparametric Methods
18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.2 Single Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.2.1 One-Sample Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Test Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison with Parametric Alternatives . . . . . . . . . . . . . . . . . . . . .
18.2.2 One-Sample Wilcoxon Signed Rank Test . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Test Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison with Parametric Alternatives . . . . . . . . . . . . . . . . . . . . .
18.3 Two Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.3.1 Two-Sample Paired Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.3.2 Mann-Whitney-Wilcoxon Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Test Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison with Parametric Alternatives . . . . . . . . . . . . . . . . . . . . .
18.4 Probability Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.4.1 The Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Test Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.4.2 The Anderson-Darling Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.5 A Comprehensive Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.5.1 Probability Model Postulate and Validation . . . . . . . . . . . . . . . . . . . . . .
18.5.2 Mann-Whitney-Wilcoxon Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPLICATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
758
760
760
760
763
763
763
765
765
766
766
766
769
769
770
770
771
771
772
772
772
775
775
778
781
784
758
Random Phenomena
posed to make our data analysis lives easier are invalid? In particular, what
happens when real life does not cooperate with the Gaussian distributional
assumptions required for carrying out the t-tests and the F -tests, and other
similar tests on which important statistical decisions rest?
Many tiptoe nervously around such issues, in the hope that the repercussions of invalid assumptions will be minimal; some stubbornly refuse to believe
that violating any of these assumptions can really have any meaningful impact on their analyses; still others navely ignore such issues, primarily out of a
genuine lack of awareness of the distributional requirements at the foundation
of these analysis techniques. But none of this is an acceptable option for the
well-trained engineer or scientist.
The objective in this chapter is to present some viable alternatives to consider when distributional assumptions are invalid. These techniques, which
make little or no demand on the specic distributional structure of the population from whence the data came, are sometimes known as distribution-free
methods. And precisely because they do not involve population parameters,
in contrast to the distribution-based techniques, they are also known as nonparametric methods. Inasmuch as entire textbooks have been written on the
subject of nonparametric statistics complete treatises on statistical analysis
without the support (and, some would say, encumbrance) of hard probability
distribution models the discussion here will necessarily be limited to only
the few most commonly used techniques. And to put the techniques in proper
context, we will compare and contrast these nonparametric alternatives with
the corresponding parametric methods, where possible.
18.1
Introduction
There are at least two broad classes of problems for which the classical
hypothesis tests discussed in Chapter 15 are unsuitable:
1. When the underlying distributional assumptions (especially the Gaussian assumptions) are seriously violated;
2. When the data in question is ordinal only, not measured on a quantitative scale in which the distance between succeeding entities is uniform
or even meaningful (see Chapter 12).
In each of these cases, even in the absence of any knowledge of the mathematical characteristics of the underlying distributions, the sample data can always
be rank ordered by magnitude. The data ranks can then be used to analyze
such data with the little or no assumptions about the probability distributions
of the populations.
Nonparametric Methods
759
TABLE 18.1:
A
professors teaching evaluation
scores organized by student
type
Graduate Undergraduate
Students
Students
3
4
4
3
4
4
2
2
3
3
4
5
4
3
5
3
2
4
4
2
4
2
4
3
4
3
4
Such nonparametric (or distribution-free) techniques have the obvious advantage that they are versatile: they can be used in a wide variety of cases, even
when distributional assumptions are valid. By the same token, they are also
quite robust (there are fewer or no assumptions to violate). However, for the
very same reasons, they are not always the most ecient. For the same sample size, the power (the probability of correctly rejecting a false H0 ) is always
higher for the parametric tests discussed in Chapter 15 when the assumptions are valid, compared to the power of a corresponding nonparametric test.
Thus, if distributional assumptions are reasonably valid, parametric methods
are preferred; when the assumptions are seriously in doubt, nonparametric
methods provide a viable (perhaps even the only) alternative.
Finally, consider the case where a professor who taught an intermediate
level statistics course to a class that included both undergraduate and graduate students, is evaluated on a scale from 1 to 5, where 5 is the highest
rating. The evaluation scores from a class of 12 graduate and 15 undergraduate students is shown in Table 18.1. If the desire is to test whether or not
the professor received more favorable ratings from graduate students, observe
that this data set is ordinal only; it is not the usual quantitative data that
is amenable to the usual parametric methods. But this ordinal data can be
ranked since, regardless of the distance between each assigned number, we
know than 5 is better than 4, which is in turn better than 3, etc. The
question of whether graduate students rated the professor higher than under-
760
Random Phenomena
and
Nonparametric tests for comparing more than two populations will be discussed at the appropriate places in the next chapter. Our presentation here
will focus on the underlying principles, with some simple illustrations of the
mechanics involved in the computations; much of the computational details
will be left to, and illustrated with, computer programs that have facilitated
the modern application of these techniques.
18.2
18.2.1
Single Population
One-Sample Sign Test
(18.1)
(18.2)
(18.3)
Nonparametric Methods
761
Furthermore, suppose that we are concerned for the moment only with the
sign of this quantity, the magnitude being of no interest for now: i.e., all
we care about is whether Xi shows a positive deviation from the postulated
median (when Xi > 0 ) or it shows a negative deviation (when Xi < 0 ).
Then,
+ Xi
(18.4)
D mi =
Xi
(Note that the requirement that X must be a continuous random variable,
rules out, in principle but not necessarily in practice the potential for a
tie where Xi exactly matches the value for the postulated median, since the
probability of this event occurring is theoretically zero. However, if by chance
Xi 0 = 0, this data is simply left out of the analysis.)
Observe that as dened in Eq (18.4), there are only two possible outcomes
for the quantity, Dmi , making it a classic Bernoulli random variable. Now, if
H0 is true, the sequence of Dmi observations should contain about an equal
number of + and entries. If T + is the total number of + signs (representing
the total number of positive deviations from the median, arising from observations greater than the median), then this random variable has a binomial
distribution with binomial probability of success pB = 0.5 if H0 is true.
Observe therefore that T + has all the properties of a useful test statistic.
The following are therefore the primary characteristics of this test:
1. The test statistic is T + , the total number of plus signs;
2. Its sampling distribution is binomial; specically, if H0 is true, T +
Bi(n, 0.5)
For any specic experimental data, the observed total number of plus signs,
t+ , can then be used to compute the rejection region or, alternatively, to test
directly for signicance by computing p-values as follows. For the one-sided
lower-tailed alternative, i.e., HaL ,
p = P (T + t+ |pB = 0.5)
(18.5)
2P (T + t+ |pB = 0.5);
p=
2P (T + t+ |pB = 0.5);
(18.6)
(18.7)
762
Random Phenomena
Example 18.1: MEDIAN OF EXPONENTIAL DISTRIBUTION
The data set S1 = {16, 1, 9, 34, 63, 44, 1, 63, 41, 29}, rst presented in
Example 14.3 (and later in Example 17.1), is the waiting time (in days)
until the occurrence of a recordable safety incident in a certain companys manufacturing site. This was postulated to be from an exponential distribution. Use the one-sample sign test to test the null hypothesis
that the median, = 21, versus the two-sided alternative that it is not.
Solution:
From the given data, and the postulated median, we easily generate the
following table:
Time Data
(S1 )
Deviation
from 0
16
1
9
34
63
44
1
63
41
29
5
20
12
13
42
23
20
42
20
8
Sign
Dmi
+
+
+
+
+
+
This shows six total plus signs, so that t+ = 6. Because of the two-sided
alternative hypothesis, and since t+ > 5, we need to compute P (x 6)
for a Bi(10, 0.5). This is obtained as:
P (x 6) = 1 P (x 5) = 1 0.623 = 0.377
(18.8)
The p-value associated with this sign test is therefore 2 0.377 = 0.754.
Therefore, there is no evidence to reject the null hypothesis that the
median is 21. (The sample median is 31.50, obtained as the average of
the 5th and 6th ranked sample data, 29 and 34.)
Nonparametric Methods
763
2
to
N (, ), with mean, = npB = 0.5n, and standard deviation, =
npB (1 pB ). Thus, the test statistic
Z=
T + 0.5n
0.5 n
(18.9)
can be used to carry out the sign test for large samples (typically > 10 15),
exactly like the standard z-test.
Comparison with Parametric Alternatives
The sign test is the nonparametric alternative to the one-sample z-test and
the one-sample t-test. These parametric tests are for hypotheses concerning
the means of normal populations; the sign test on the other hand is for the
median of any general population with a continuous probability distribution.
If data is susceptible to outliers or the distribution is long-tailed (i.e., skewed),
the sign test is more appropriate; if the distribution is close to normal, the
parametric t- and z-tests will perform better.
18.2.2
764
Random Phenomena
when the entire sample exceeds the postulated value for the median, 0 . It can
attain a minimum value of 0. Between these extremes, W + takes other values
that are easily determined via combinatorial computations. And now, if H0 is
true, the sampling distribution for W + can be computed numerically and used
to determine signicance levels. The computations involved in determining
the sampling distribution of W + under H0 are cumbersome analytically, but
relatively easy to execute with a computer program; the test is thus best
carried out with statistical software packages.
As usual, large sample approximations exist. In particular, it can be shown
that the sampling distribution for W + tends to N (, 2 ) for large n, with
= n(n + 1)/4 and 2 = n(n + 1)(2n + 1)/24. But it is not recommended to
use the normal approximation because it is not suciently precise; besides,
the computations involved in the exact test are trivial for computer programs.
We use the next example to illustrate the mechanics involved in this test
before completing the test itself with MINITAB.
Example 18.2: CHARACTERIZING SOME HOUSEKEEPING GENES
In genomic studies, genes that are involved in basic functions needed to
keep the cell alive are always turned on (i.e., they are constitutively
expressed). Such genes are known colloquially as housekeeping genes.
Because their gene expression status hardly changes, they are sometimes used to calibrate experimental systems for measuring changes
in gene expression. Data on 10 such putative housekeeping genes has
been selected from a larger set of microarray data and presented as ,
the fractional intensity, Itest /(Itest + Iref ), where Itest is the measured
uorescence intensity under test conditions, and Iref , the intensity under reference conditions. Within the limits of random variability, the
values of determined for housekeeping genes should come from a symmetric distribution scaled between 0 and 1. If these 10 genes are true
housekeeping genes, the median of the data population for should
be 0.5. To illustrate the mechanics involved in using the one-sample
Wilcoxon signed rank test to test this hypothesis against the alternative that the median is not 0.5, the following table is a summary of the
raw data and the subsequent analysis required for carrying out this test.
i
0.585978
0.504057
0.182831
0.426575
0.455191
0.804720
0.741598
0.332909
0.532131
0.610620
Dmi = i 0.5
0.085978
0.004057
0.317169
0.073425
0.044809
0.304720
0.241598
0.167091
0.032131
0.110620
|Dmi |
Rank
0.085978
0.004057
0.317169
0.073425
0.044809
0.304720
0.241598
0.167091
0.032131
0.110620
5
1
10
4
3
9
8
7
2
6
Signed
Rank
5
1
10
4
3
9
8
7
2
6
Nonparametric Methods
765
The rst column is the raw fractional intensity data; the second
column is the deviation from the median whose absolute value is shown
in column 3, and ranked in column 4. The last column shows the signed
rank. Note that in this case, w+ = 31, the sum of all the ranks carrying
the plus sign, i.e., (5 + 1 + 9 + 8 + 2 + 6).
When MINITAB is used to carry out this test, the required sequence is Stat
> Nonparametrics > 1-Sample Wilcoxon; the result is shown below:
Wilcoxon Signed
Test of median =
N for
N Test
Phi 10
10
18.3
Two Populations
The general problem involves two separate and mutually independent populations, with respective unknown medians 1 and 2 . As with the parametric
766
Random Phenomena
tests, we are typically concerned with testing hypotheses about the dierence
between these two medians, i.e.,
1 2 = .
(18.10)
Depending on the nature of the data sets, we can identify two categories: (i)
the special case of paired data; and (ii) the more general case of unpaired
samples.
18.3.1
In this case, once the dierences between the pairs, Di = X1i X2i have
been obtained, this can be treated exactly like the one-sample case presented
above. The hypothesis will be tested on a postulated value for the median of
Di , say 0 , which need not be zero. When 0 = 0, the test reduces to a test of
the equality of the two population medians. In any event, the one-sample tests
discussed above, either the sign test, or the Wilcoxon signed rank test (if the
distribution of the dierence, Di can be considered as reasonably symmetric)
can now be applied. No additional considerations are needed.
Thus, the two-sample paired test is exactly the same as the one-sample
test when it is applied to the paired dierences.
18.3.2
Mann-Whitney-Wilcoxon Test
(18.11)
where i is the median for distribution i. This is tested against the usual
triplet of alternatives. For tests of equality, 0 = 0.
Basic Test Characteristics
The test uses a random sample of size n1 from population 1,
X11 , X12 . . . , X1n1 , and another of size n2 from population 2, X21 , X22 . . . , X2n2 ,
where n1 need not equal n2 . First, the entire nT = n1 + n2 observations are
combined and ranked in ascending order as though they were from the same
population. (Identical observations are assigned equal ranks determined as the
average of the individual assigned ranks had they been dierent).
Two related statistics can now be identied: W1 , the sum of the ranks in
sample 1, and W2 the equivalent sum for sample 2. Again, we note that
W1 + W2 =
nT (nT + 1)
2
(18.12)
Nonparametric Methods
767
the sum of the rst nT integers. And now, because this sum is xed, observe
that a small value for one (adjusted for the possibility that n1 = n2 ) automatically implies a large value for the other, and hence a large dierence
between the two. Also note that the maximum attainable value for W1 is
n1 (n1 + 1)/2, and for W2 , n2 (n2 + 1)/2. The Mann-Whitney U test statistic
used to determine signicance levels, is dened as follows
U1
U2
n1 (n1 + 1)
2
n2 (n2 + 1)
W2
2
min(U1 , U2 )
W1
(18.13)
The original Wilcoxon test statistic is slightly dierent but leads to equivalent
results; it derives from Eq (18.12), and makes use of the larger of the two rank
sums, i.e.,
nT (nT + 1)
W2
(18.14)
W1 =
2
if W1 > W2 and reversed if W1 < W2 .
The sampling distribution for U or for W1 , (or W2 ) are all somewhat like
that for W + above; they can be determined computationally for any given
n1 and n2 and used to compute signicance levels. As such, the entire test
procedure is best carried out with the computer.
We now use the next example to illustrate rst the mechanics involved in
carrying out this test, and then complete the test with MINITAB.
Example 18.3: TESTING FOR SAFETY PERFORMANCE
IMPROVEMENT
Revisit the problem rst presented in Example 14.3 regarding a companys attempt to improve its safety performance. Specically, we recall
that to improve its safety record, the company began tracking the time
in between recordable safety incidents. During the rst year of the program, the waiting time (in days) until the occurrence of a recordable
safety incident was obtained as
S1 = {16, 1, 9, 34, 63, 44, 1, 63, 41, 29}
The data record for the second year of the program is
S2 = {35, 26, 16, 23, 54, 13, 100, 1, 30, 31}
On the basis of this data, which shows a mean time to incident as 30.1
days for the rst year and 32.9 days for the second year, the companys
safety coordinator is preparing to make the argument to upper management that there has been a noticeable improvement in the companys
safety record from Year 1 to Year 2. Is there evidence to support this
claim?
768
Random Phenomena
Solution:
First, from phenomenological reasonings, we expect the data to follow the exponential distribution. This was conrmed as reasonable in
Chapter 17. When this fact is combined with a sample size that is relatively small, this becomes a clear case where the t-test is inapplicable.
We could use distribution-appropriate interval estimation procedures to
answer this question (as we did in Example 14.14 of Chapter 14). But
such procedures are too specialized and involve custom computations.
Problems of this type are ideal for the Mann-Whitney-Wilcoxon
test. A table of the data and a summary of the subsequent analysis
required for this test is shown here.
Year 1, S1
Rank
Year 2, S2
Rank
16
1
9
34
63
44
1
63
41
29
6.5
2.0
4.0
13.0
18.5
16.0
2.0
18.5
15.0
10.0
35
26
16
23
54
13
100
1
30
31
14.0
9.0
6.5
8.0
17.0
5.0
20.0
2.0
11.0
12.0
First, note how the complete set of 20 total observations have been
combined and ranked. The lowest entry, 1, occurs in 3 places twice
in sample 1, and once in sample 2. They are each assigned the rank of
2 which is the average of 1+2+3. The other ties, two entries of 16, and
two of 63 are given respective ranks of 6.5 and 18.5.
Next, the sum of ranks for sample 1 is obtained as w1 = 105.5; the
corresponding sum for sample 2 is w2 = 104.5. Note that these sum up
to 210, which is exactly nT (nT + 1)/2 with nT = 20; the larger w1 is
used to determine signicance level. But even before the formal test is
carried out, note the closeness of these two rank sums (the sample sizes
are the same for each data set). This already gives us a clue that we
are unlikely to nd evidence of signicant dierences between the two
medians.
To carry out the MWW test for this last example using MINITAB, the
required sequence is Stat > Nonparametrics > Mann-Whitney; the ensuing
results are shown below.
Mann-Whitney Test and CI: S1, S2
N Median
S1 10
31.5
S2 10
28.0
Point estimate for ETA1-ETA2 is -0.00
Nonparametric Methods
769
770
18.4
Random Phenomena
When the hypothesis test to be performed involves not just the mean,
median or variance of a distribution but the entire distribution itself, the
truly distribution-free approach is the Kolmogorov-Smirnov (K-S) test. The
critical values of the test statistic (that determine the rejection region) are
entirely independent of the specic distribution being tested. This makes the
K-S test truly nonparametric; it also makes it less sensitive, especially to
discrepancies between the observed data and the tail area characteristics of
many distributions. The Anderson-Darling (A-D) test, a modied version of
the K-S test designed to improve the K-S test sensitivity, achieves its improved
sensitivity by making explicit use of the specic distribution to be tested in
computing the critical values. This introduces the primary disadvantage that
critical values must be custom-computed for each distribution. But what is
lost in generality is more than made up for in overall improved performance.
We now review, briey, the key characteristics of these two tests.
18.4.1
n(i)
n
(18.15)
Let F (x(i) ) represent the theoretical cumulative probability P (X x(i) ) computed for a specic observation, x(i) , using the theoretical cdf postulated for
the population from which the data is purported to have come. The K-S test
Nonparametric Methods
771
is based on the dierence between this theoretical cdf and the empirical cdf
as in Eq (18.15). The test statistic is:
1
i1
D = max
F (X(i) )
(18.16)
F (X(i) )
,
1<i<n
n
n
As with other hypothesis tests, the null hypothesis is rejected at the signicant
level of if D > D , where the critical value D is typically obtained from
computations easily carried out in software packages.
Key Features
The primary distinguishing feature of the K-S test is that the sampling
distribution of its test statistic, D, is entirely independent of the postulated
distribution that is being tested. This makes the test truly distribution-free. It
is a nonparametric alternative to the chi-squared goodness-of-t test discussed
in Chapter 17. Unlike that test which requires large sample sizes for the 2
distribution approximation for the test statistic to be valid, the K-S test is an
exact test.
A key primary limitation is its restriction to continuous distributions that
must be completely specied, i.e., the distribution parameters cannot be estimated from sample data. Also, because it is based on a single point with the
worst distance between theoretical and empirical distributions, the test is
prone to ignoring less prominent but still important mismatches at the tails,
in favor of more inuential mismatches at the center of the distribution. The
K-S test is therefore more likely to have less power than a test that employs
a more broadly based test statistic that is evenly distributed over the entire
variable space. This is the raison detre for the Anderson-Darling test which
has all but supplanted the K-S test in many appplications.
18.4.2
The Anderson-Darling test1 is a modication of the K-S test that (a) uses
the entire cumulative distribution (not just the single worst point of departure
from the empirical cdf), and hence, (b) gives more weight to the distribution
tails than is possible with the K-S test.
The test statistic is:
A2 = n
n
2i 1
i=1
!
"
ln F (X(i) ) + ln 1 F (X(n+1i) ) .
(18.17)
However, the sampling distribution of A2 depends on the postulated distribution function; critical values therefore must be computed individually for
each distribution under consideration. Nevertheless, these critical values are
1 Anderson, T. W.; Darling, D. A. (1952). Asymptotic theory of certain goodness-of-t
criteria based on stochastic processes. Annals of Mathematical Statistics 23: 193212
772
Random Phenomena
available for many important distributions, including (but not limited to) the
normal, lognormal, exponential, and Weibull distributions. The test is usually
carried out using statistical software packages.
Key Features
The primary distinguishing feature of the A-D test is that it is more sensitive than the K-S test, but for this advantage, the critical values for A2
must be custom-calculated for each distribution. The test is applicable even
with small sample sizes, n < 20. It is therefore generally considered a better
alternative to the chi-square and K-S goodness-of-t tests.
18.5
18.5.1
First, from phenomenological considerations, these interspike intervals represent time to the occurrence of several poisson events, suggesting that they
might follow the gamma (, ) distribution, where will represent the effective number of poisson events leading to each observed action potential,
and , the mean rate of occurrence of the poisson events. A histogram of
both data sets, with superimposed gamma distribution ts, is shown in Fig
18.1. Visually, the ts appear quite good, but we can make more quantitative
statements by carrying out formal probability model validation.
Because the postulated gamma model is continuous, we choose the probability plot approach and the Anderson-Darling test. The null hypothesis is
that both sets of ISI data follow the gamma distribution; the alternative is
2 Braitenberg, (1965): What can be learned from spike interval histograms about synaptic mechanism? J. Theor. Biol. 8, 419425)
Nonparametric Methods
773
TABLE 18.2:
25 28 53
49 52 94
56 59 107
52 35 23
48 41 100
113 81 52
50 27 70
76 52 33
68 69 40
108 65 47
45 34 97
73 97 16
105 66 74
60 93 61
70 57 74
64 31 85
39 103 71
92 39 59
49 51 75
94 73 125
71 47 69
95 60 77
54 43 140
54 42 100
56 33 41
74
221
228
71
145
73
94
80
78
79
132
73
129
119
79
89
66
209
84
86
110
122
119
119
91
108
65
65
79
46
74
35
55
51
50
75
47
79
63
95
37
61
60
48
44
71
49
51
64
71
135
137
80
116
119
55
88
143
121
130
68
97
66
111
85
136
105
175
157
133
103
89
94
141
95
132
125
173
44
99
150
236
157
49
162
139
152
202
151
56
105
60
96
81
178
73
145
98
102
75
94
146
179
195
156
192
143
78
199
98
138
187
63
86
104
116
133
137
89
116
85
90
222
81
63
774
Random Phenomena
Histogram of PT-W
Gamma
25
Shape 7.002
Scale 9.094
N
100
Frequency
20
15
10
20
40
60
80
PT-W
100
120
140
Histogram of PT-S
Gamma
25
Shape 7.548
Scale 15.61
N
100
Frequency
20
15
10
40
80
120
160
200
240
PT-S
FIGURE 18.1: Histograms of interspike intervals data with Gamma model t for
the pyramidal tract cell of a monkey. Top panel: when awake (PT-W); Bottom Panel:
when asleep (PT-S). Note the similarities in the estimated values for the shape
parameterfor both sets of data, and the dierence between the estimates for , the
scale parameters.
Nonparametric Methods
775
that they do not. The results of this test (carried out in MINITAB) are shown
in Fig 18.2. The p-value for the A-D test in each case is higher than the typical signicance level of 0.05, leading us to conclude that it is reasonable to
consider the data as coming from gamma-distributed populations. The probability plots also show, for both cases, the entire data sets falling entirely within
the 95% condence intervals of the theoretical model line ts. The implication
therefore is that we cannot use the two sample t-test for this problem, since
the Gaussian assumption is invalid for conrmed gamma-distributed data.
An examination of the histograms in fact shows two skewed distributions that
have essentially the same shape, but the one for PT-S appears shifted to the
right. The recommendation therefore is to use the Mann-Whitney-Wilcoxon
test.
18.5.2
Mann-Whitney-Wilcoxon Test
Putting the data into two columns PT-W and PT-S in a MINITAB worksheet, and carrying out the test as illustrated earlier, yields the following
results:
Mann-Whitney Test and CI: PT-W, PT-S
N Median
PT-W 100
60.01
PT-S 100 110.56
776
Random Phenomena
Shape
Scale
N
AD
P-Value
99
Percent
95
90
80
70
60
50
40
30
20
7.002
9.094
100
0.255
>0.250
10
5
1
0.1
10
100
PT-W
Shape
Scale
N
AD
P-Value
99
Percent
95
90
80
70
60
50
40
30
7.548
15.61
100
0.477
0.245
20
10
5
1
0.1
10
100
PT-S
FIGURE 18.2: Probability plot of interspike intervals data with postulated Gamma
model and Anderson-Darling test for the pyramidal tract cell of a monkey. Top panel:
when awake (PT-W); Bottom panel: when asleep (PT-S). The p-values for the A-D
tests indicate no evidence to reject the null hypothesis
Nonparametric Methods
18.6
777
We have presented in this chapter some of the important nonparametric alternatives to consider when the assumptions underlying standard (parametric)
hypothesis tests are not valid. These nonparametric techniques are applicable
generally because they impose few, if any, restrictions on the data. They are
known as distribution free methods because they do not rely on any specic
distributional characterization for the underlying population. As a result, the
primary variable of nonparametric statistical analysis is the rank sum, and for
good reason: regardless of the underlying population from which they arose,
all dataeven qualitative data (so long as it is ordinal)can be ranked, and
the appropriate sums of such ranks provide valuable information about how
the data set is distributed, without having to assume a specic functional form
for the populations distribution.
As summarized in Table 18.3, we have focussed specically on:
The sign test: for comparing a single continuous population median, ,
to a postulated value, 0 . A nonparametric alternative to the one sample
z- or t-test, it is based on the sign of the deviation of sample data from
the postulated median. The test statistic, T + the total number of plus
signsis a binomial distributed Bi(n, 0.5) random variable if the null
hypothesis is true.
The Wilcoxon signed rank test: a more powerful version of the sign test
because it takes both magnitude and sign of the deviation from the
postulated median into consideration, but it is restricted to symmetric
distributions. The test statistic, W + , is the positive rank sum; its sampling distribution is best obtained numerically.
778
Random Phenomena
REVIEW QUESTIONS
1. What is the objective of this chapter?
2. Why are the techniques discussed in this chapter known as distribution free
methods?
3. What are the two broad classes of problems for which classical hypothesis tests
of Chapter 15 are not applicable?
4. What are some of the advantages and disadvantages of non-parametric techniques?
5. What is the one-sample sign test?
6. What does it mean that a tie has occurred in a sign test, and what is done
under such circumstances?
7. What is the test statistic used for the one-sample sign test, and what is its sampling distribution?
8. What is the large sample limiting test statistic for the one-sample sign test?
3 Gibbons, J. D. and S. Chakraborti, (2003). Nonparametric Statistical Inference, 4th
Ed. CRC Press.
= 1 2 ; (H0 : = 0 )
(General)
= 1 2 ; (H0 : = 0 )
(Paired)
; (H0 : = 0 )
Median
Test
Restrictions Statistic
None
T + , Total # of
positive deviations
from 0
Wilcoxon signed
Continuous
W + , Positive
Rank (WSR) test symmetric
rank sum
distributions
Sign test or
Same as
Same as
WSR test
above
above
Mann-WhitneyIndependent
W1 , Rank sum 1
Wilcoxon
Distributions W2 , Rank sum 2
(MWW) test
U = f (W1 , W2 )
See Eq (18.13)
Test
Sign test
Population
Parameter
(Null Hypothesis, H0 )
; (H0 : = 0 )
Median
TABLE 18.3:
Two-sample t-test
Same as
above
None
Parametric
Alternative
One-sample z-test
One-sample t-test
Nonparametric Methods
779
780
Random Phenomena
Nonparametric Methods
781
27. What are the null and alternative hypotheses in the Kolmogorov-Smirnov test?
28. What is the primary distinguishing feature of the sampling distribution of the
Kolmogorov-Smirnov test statistic?
29. The Kolmogorov-Smirnov test is a non-parametric alternative to which parametric test?
30. What are two primary limitations of the Kolmogorov-Smirnov test?
31. How is the Anderson-Darling test related to the Kolmogorov-Smirnov test?
32. The Anderson-Darling test is generally considered to be a better alternative to
which tests?
EXERCISES
18.1 In the table below, X1 is a random sample of 20 observations from an exponential population with parameter = 1.44, so that the median, = 1. X2 is the
same data set plus a constant, 0.6, and random Gaussian noise with mean = 0
and standard deviation = 0.15.
X1
1.26968
0.28875
0.07812
0.45664
0.68026
2.64165
0.21319
2.11448
1.43462
2.29095
X2
1.91282
1.13591
0.72515
1.19141
1.34322
3.18219
0.88740
2.68491
2.16498
2.84725
X1
1.52232
1.45313
0.65984
1.60555
0.08525
0.03254
0.75033
1.34203
1.25397
3.16319
X2
2.17989
2.11117
1.45181
2.45986
0.43390
0.76736
1.16390
2.01198
1.80569
3.77947
(i) Consider the postulate that the median for both data sets is 0 = 1 (which, of
course, is not true). Generate a table of signs, Dmi , of deviations from this postulated median.
(ii) For each data set, determine the test statistic, T + . What percentage of the observations in each data set has plus signs? Informally, what does this indicate about
the possibility that the true median in each case is as postulated?
18.2 Refer to Exercise 18.1 and the supplied data.
(i) For each data set, X1 and X2 , carry out a formal sign test of the hypothesis
H0 : = 1 against Ha : = 1. Interpret your results.
(ii) Now use only the rst 10 observations (on the left) in each data set. Repeat (i)
above. Interpret your results. Since we know that the two data sets are dierent,
and that the median for X2 is higher by 0.6 (on average), comment on the eect of
782
Random Phenomena
XL2
1.61889
1.15897
1.17163
1.09065
1.27686
0.91838
1.45123
1.47800
2.16068
1.46116
18.5 Refer to Exercise 18.4. (i) Carry out a MWW test on the equality of the medi-
Nonparametric Methods
783
ans of XL1 and XL2 , versus the alternative that the medians are dierent. Interpret
your result.
(ii) Carry out a two-sample z-test concerning the equality of the means of the log
transformed variables Y1 and Y2 , against the alternative that the means are dierent. Interpret your results.
(iii) Comment on any dierences observed between these two sets of results.
18.6 The data in the table below are from two populations that may or may not be
the same.
XU
0.65
2.01
1.80
1.13
1.74
1.36
1.55
1.05
1.55
1.63
YU
1.01
1.75
1.27
2.48
2.91
2.38
2.79
1.94
(i) Carry out a two-sample t test to compare the population means. What assumptions are required for this to be a valid test? At the = 0.05 signicance level, what
does this result imply about the null hypothesis?
(ii) Carry out a MWW test to determine if the two populations have the same medians or not. At the = 0.05 signicance level, what does this result imply about
the null hypothesis?
(iii) It turns out that the data were generated from two distinct uniform distributions, where the Y distribution is slightly dierent. Which test, the parametric or
the nonparametric, is more eective in this case? Oer some reasons for the observed
performance of one test versus the other.
(iv) In light of what was specied about the two populations in (iii), and the p-values
associated with each test, comment on the use of = 0.05 as an absolute arbiter of
signicance.
18.7 In an opinion survey on a particular political referendum, fteen randomly
samples individuals were asked to give their individual opinions using the following
options:
1= Strongly agree; 2 = Agree; 3 = Indierent; 4 = Disagree; 5 = Strongly disagree.
The result is the data set, S15 .
S15 = {1, 4, 5, 3, 2, 3, 3, 2, 2, 2, 1, 3, 3, 3, 3}
Carry out a one-sample sign test to conrm or refute the allegation that the population from which the 15 respondents were sampled has a median of people that are
indierent to the referendum in question.
18.8 The following is a set of residuals from a two-parameter regression model:
784
Random Phenomena
0.97
0.68
0.12
0.04
0.27
0.33
0.43
0.04
0.17
0.34
0.10
0.85
0.58
0.12
0.96
0.14
0.04
0.49
0.54
0.09
(i) Carry out a K-S test of normality and compare it with an A-D test of normality. What are the associated p-values and what do these imply about the normality
of these residuals?
(ii) It turns out that the residuals shown above had been ltered by removing
four observations that appeared to be outliers: {1.30, 1.75, 2.10, 1.55}. Reinstate
these residuals and repeat (i). Does this change any of the conclusions about the
normality of the residuals?
18.9 The following data is from a population that may or may not be normally
distributed.
0.88
0.74
2.06
0.30
6.86
1.06
1.42
3.08
2.42
1.12
0.29
0.63
0.74
0.58
0.91
0.48
2.30
3.15
0.32
6.70
Carry out a K-S test of normality and compare the results with an A-D test
of normality. At the = 0.05 signicance level, what do each of these tests imply
about the normality of the data set? Examine the data set carefully and comment
on which of these tests is more likely to lead to the correct decision.
APPLICATION PROBLEMS
18.10 In Problem 15.59, hypothesis tests were devised to ascertain whether or not
out of three operators, A, B, and C, working in a toll manufacturing facility,
operator A was deemed to be more safety conscious. The data below, showing the
time in months between occurrences of safety violations for each operator, was to
be used to test these hypotheses.
A
B
C
1.31
1.94
0.79
0.15
3.21
1.22
3.02
2.91
0.65
3.17
1.66
3.90
4.84
1.51
0.18
0.71
0.30
0.57
0.70
0.05
7.26
1.41
1.62
0.43
2.68
6.75
0.96
0.68
1.29
3.76
Nonparametric Methods
785
two-sample t-test comparing the means for operator A to that for operator C.
Are these valid tests?
(ii) Carry out a MWW test to compare the safety performance of operator A to
that for operator B and a second MWW test to compare operator A to operator
C. Are these valid tests? Interpret your results and compare them to the results in
(i). What conclusions can you reach about these operators and how safety conscious
B and C are in comparison to A?
18.12 The Philadelphia Eagles, like every other team in the National Football
League (NFL) in the US, plays 8 games at home and 8 games away in each 16game season. The table below shows the total number of points scored by this NFL
team at home and away during the 2008/2009 season.
Total Points Scored, 2008/2009 Season
Home 38 15 17 27 31 48 30 44
Away 37 20 40 26 13 7 20 3
(i) Generate side-by-side box-plots of the two data sets and comment on what these
plots shows about the potential dierence between the number of points scored at
home and away.
(ii) Use the two-sample t-test to compare the mean oensive productivity at home
versus away. What assumptions are necessary for this to be a valid test? Are these
assumptions reasonable? Interpret your result.
(iii) Next, carry out a MWW test. Interpret your result.
(iv) Allowing for the fact that there are only 8 observations for each category, discuss
your personal opinion of what the data set implies about the dierence in oensive
output at home versus away, vis a
` vis the results of the formal tests.
18.13 Refer to Problem 18.12. This time, the table below shows the point
dierentialpoints scored by the Philadelphia Eagles minus points scored by the
opponentat home and away, during the 2008/2009 season. Some consider this
metric a better measure of ultimate team performance (obviously, a negative point
dierential corresponds to a loss, a positive dierential a win, and a zero, a tie).
Home
Away
38
7
(i) Generate side-by-side box-plots of the two data sets and comment on what this
plot shows about the potential dierence between the teams performance at home
and away.
(ii) Carry out a two-sample t-test to compare the teams performance at home versus away. What assumptions are necessary for this to be a valid test? Are these
assumptions reasonable? Interpret your results.
(iii) Next, carry out a MWW test. Interpret your result.
(iv) Again, allowing for the fact that there are only 8 observations in each case,
discuss your personal opinion about the dierence in team performance at home
versus away vis a
` vis the results of the formal tests.
18.14 The table below shows the result of a market survey involving 15 women and
786
Random Phenomena
15 men who were asked to taste a name brand diet Cola drink and compare the
taste to that of a generic brand that is cheaper, but, as claimed by the manufacturer, whose taste is preferred by a majority of tasters. The options given to the
participants are as follows: 1= Strongly prefer generic cola; 2 = Prefer generic cola;
3 = Indierent; 4 = Prefer name brand cola; 5 = Strongly prefer name brand cola.
Perform appropriate tests to validate the claims that (i) Men are mostly indifferent, showing no preference one way or another (which is a positive result for
the generic cola manufacturer); and (ii) there is no dierence between Women and
Men in their preferences for the diet Cola brands. Is there evidence in the data to
support one or both or none of these claims?
Women
4
3
5
5
2
3
1
4
4
5
4
3
4
3
4
Men
1
3
3
2
4
5
1
2
4
3
3
2
1
3
3
18.15 Random samples of size 10 each are taken from large groups of trainees instructed by Method A and Method B, and each trainees score on an appropriate
achievement test is shown below.
Method A
Method B
71
72
75
77
65
84
69
78
73
69
66
70
68
77
71
73
74
65
68
75
In Example 15.6, the data sets were assumed to come from normal populations
and a two-sample t-test was conducted to test the claim that Method B is more efcient. As an alternative to that parametric test, carry out a corresponding MWW
test. Interpret your result and compare it to that in Example 15.6. Is there a dierence in these results? Comment on which test you would consider more reliable and
why.
18.16 Tanaka-Yamawaki (2003)4 , presented models of high-resolution nancial time
series which showed, among other things, that price uctuations tend to follow the
Cauchy distribution, not the Gaussian distribution as usually presumed. The following table shows a particular sequence of similar price uctuations.
4 Mieko Tanaka-Yamawaki, (2003). Two-phase oscillatory patterns in a positive feedback
agent model Physica A 324, 380387
Nonparametric Methods
0.003322
0.000856
0.010382
0.011494
0.012165
0.000637
0.002606
0.061254
0.004949
0.008889
0.003569
0.003797
0.261127
0.005694
0.023339
787
0.032565
0.001522
0.032485
0.034964
0.009220
(i) Obtain a histogram and a box plot of this data. Interpret these plots vis a
`
vis the usual normality assumption of data of this sort.
(ii) Carry out a K-S test and also an A-D test of normality on this data. What can
you conclude about the normality of this data set?
(iii) The data entry, 0.261127, although real, is what most might refer to as an
outlier which will then be removed. Remove this data point and repeat (i) and
(ii). Comment on the inuence, if any, of this point on your analysis.
18.17 The number of accidents occurring per quarter (three months) at a DuPont
company facility, over a 10-year period is shown in the table below5 , partitioned into
two periods: Period I for the rst ve-year period of the study; Period II, the second
ve-year period.
5
4
2
5
6
Period I
5 10 8
5 7
3
8 6
9
6 5 10
3 3 10
3
1
7
1
4
Period II
4 2 0
3 2 2
7 1 4
2 2 1
4 4 4
788
Random Phenomena
Total # of Breakdowns
Brand A
Brand B
11
9
9
12
12
10
14
8
13
4
12
10
11
8
18.19 In Problem 15.52 (see also Problems 1.13 and 14.42) the data in the following table was analyzed to test a hypothesis about the mean time-to-publication for
papers sent to a particular leading chemical engineering research journal. The data
shows the time (in months) from receipt to publication of 85 papers published in
the January 2004 issue of this journal.
(i) On phenomenological grounds, and from past experience, a gamma distribution
has been postulated for the population from which the data was sampled. The population parameters are unavailable. Carry out an A-D test to validate this probability
model postulate.
(ii) In recognition of the fact that the underlying distribution is obviously skewed,
the Editor-in-Chief has modied his hypothesis about the characteristic time-topublication, and now proposes that the median time-to-publication is 8 months.
Carry out an appropriate test to assess the validity of this statement against the
alternative that the median time-to-publication is not 8 months. Interpret your results.
(iii) Repeat the test in (ii), this time against the alternative that the median timeto-publication is longer than 8 months. Reconcile this result with the one in (ii).
19.2
9.0
17.2
8.2
4.5
13.5
20.7
7.9
19.5
8.8
18.7
7.4
9.7
13.7
8.1
8.4
10.8
15.1
5.3
12.0
3.0
18.5
5.8
6.8
14.5
3.3
11.1
16.4
7.3
7.4
7.3
5.2
10.2
3.1
9.6
12.9
17.3
6.0
24.3
21.3
19.3
2.5
9.1
8.1
9.8
15.4
15.7
8.2
8.8
7.2
12.8
4.2
4.2
7.8
9.5
3.9
8.7
5.9
5.3
1.8
10.1
10.0
18.7
5.6
3.3
7.3
11.3
2.9
5.4
15.2
8.0
11.7
17.2
4.0
3.8
7.4
5.3
10.6
15.2
11.5
5.9
20.1
12.2
12.0
8.8
18.20 Many intrinsic characteristics of single crystals aect the intensities of X-rays
diracted from them in dierent directions. The statistical distribution of these intensity measurements therefore provides means by which to characterize these crystals. Unfortunately, because of their distinctive heavy tails, these distributions
are not adequately described by traditional default Gaussian distributions, which
Nonparametric Methods
789
is why Cauchy distributions have been nding increasing application in crystallographic statistics. (See for example Mitra and Sabita, (1989)6 , and (1992)7 .)
The table below is extracted from a larger sample of X-ray intensity measurements (arbitrary units) for two crystals that are very similar in structure, one labeled
A is natural, the other, B, synthetic. The synthetic crystal is being touted as a replacement for the natural one.
X-ray intensities (AU)
XRDA
XRDB
104.653
132.973
106.115
114.505
104.716
115.735
104.040
114.209
105.631
114.440
104.825
100.344
117.075
116.067
105.143
114.786
105.417
115.015
98.574
99.537
105.327
115.025
104.877
115.120
105.637
107.612
105.305
116.595
104.291
114.828
100.873
114.643
106.760
113.945
105.594
113.974
105.600
113.898
105.211
125.926
105.559
114.952
105.583
117.101
105.357
113.825
104.530
117.748
101.097
116.669
105.381
114.547
104.528
113.829
105.699
115.264
105.291
116.897
105.460
113.026
(i) Carry out and A-D normality tests on each data set and conrm that these
distributions, even though apparently symmetric, are not Gaussian. Their heavy
tails imply that the Cauchy distribution may be more appropriate.
(ii) The signature characteristic of the Cauchy distribution, that none of its moments
exists, has the serious implication for statistical inference that, unlike other nonpathological distributions for which the sampling distribution of the mean narrows
with increasing sample size, the distribution of the mean of a Cauchy sample is
6 Mitra, G.B. and D. Sabita (1989). Cauchy Distribution, Intensity Statistics and Phases
of Reections from Crystal Planes. Acta Crystallographica A 45, 314-319.
7 Mitra, G.B. and D. Sabita (1992). Cauchy distribution of X-ray intensities: a note on
hypercentric probability distributions of normalized structure factors. Indian Journal of
Physics 66 A (3), 375-378.
790
Random Phenomena
precisely the same as the original mother distribution. As a result, none of the
standard parametric tests can be used for samples from Cauchy (and Cauchy-like)
distributions.
From the sample data provided here, carry out an appropriate nonparametric
test to ascertain the validity of the proposition that the synthetic crystal B is
the same as the natural crystal A, strictly on the basis of the X-ray intensity
measurement.
Chapter 19
Design of Experiments
792
793
794
795
795
797
797
797
798
799
803
803
805
805
805
806
806
807
810
811
811
812
812
812
814
814
815
815
816
817
818
821
821
822
822
823
823
824
825
825
825
827
827
827
828
791
792
Random Phenomena
831
832
833
833
834
834
835
836
837
837
837
839
840
842
849
Every experimental investigation presupposes that the sought-after information is contained in the acquired data sets. Objectively, however, such presupposition is often unjustiable. Without giving purposeful, deliberate and careful consideration to data collection, the information content of the acquired
data set cannot be guaranteed. Because experimental data will always be nite samples drawn from the much larger population of interest, conclusions
drawn about the population will be valid only if the sample is representative.
And the sample will be representative only if it encapsulates relevant population information appropriately. These issues have serious consequences. If the
sought-after information is not contained in the data, no analysis technique
no matter how sophisticated will be able to extract it. Therefore, how
data sets are obtained will aect not only the information they contain, but
also our ability to extract and use this information.
This chapter focusses on how experiments can be designed to ensure that
acquired data sets are as informative as possible, and what analysis procedures
are jointly calibrated with the experimental designs to facilitate the extraction
of such information. In recognition of the fact that several excellent booklength treatises exist on the subject matter of design of experiments, this
chapter is designed to be only an introduction to some of the most commonly
applied techniques, emphasizing principles and, where possible, illustrating
practice with relevant examples. Because of the role now played by computer
software, our discussion deemphasizes the old-fashioned mechanics of manual
computations. This allows us to focus on the essentials of experimental designs
and on interpreting the results produced by computer programs.
Design of Experiments
19.1
19.1.1
793
Introductory Concepts
Experimental Studies and Design
794
Random Phenomena
because, associated with every observation, yi , is an unavoidable and unpredictable uctuation, i , masking the true value ; i.e.,
yi = + i
(19.1)
As far as the rst task is concerned, the key consequence of Eq (19.1) is that
for acquired data to be informative, one must nd a way to maximize the
eect of independent variables on y and minimize the inuence of the random
component. With the second task, the repercussion of Eq (19.1) is that ecient
data analysis requires using appropriate concepts of probability and statistics
presented in earlier chapters. Most revealing, however, is how much Task 1
inuences the success of experimental studies: information not contained in
the data cannot be extracted even by the most sophisticated analysis.
Statistical Design of Experiments enables ecient conduct of experimental studies by providing a formal strategy of experimentation and the
corresponding analysis technique, jointly calibrated for optimal extraction
of information in the face of unavoidable random variability. This is especially important when resources, both nancial and otherwise, are limited.
The techniques to be discussed in this chapter allow the acquisition of richly
informative data with the fewest possible experimental runs. Depending on
the goals of the investigation whether it is screening for which independent
variables matter the most; developing a quantitative model; nding optimum
operating conditions; conrming model predictions, etc there are appropriate experimental designs specically calibrated to the task at hand.
19.1.2
Experimental studies that deliver the most benet for the expended eort
are carried out in distinct and identiable phases:
1. Planning: where the scope, and goals of the experimental study are
clearly dened;
2. Design: where the strategy of experimentation that best ts the goals is
determined and the explicit procedure for data gathering is specied;
3. Implementation: which involves mostly disciplined execution of the design strategy and the careful collection of the data;
4. Analysis: where the appropriate statistics are computed and the results
are interpreted.
This chapter is concerned primarily with design and analysis, with the
Design of Experiments
795
19.1.3
The general problem at hand involves testing for the existence of real
eects of k independent variables, x1 , x2 , . . . , xk , on possibly several dependent
variables. Here are some examples: investigations into the eect of pH (x1 ), salt
concentration (x2 ), salt type (x3 ), and temperature (x4 ), on Y , the solubility
of a particular globular protein; or whether or not dierent kinds of fertilizers
(x1 : Type I or II), farm location (x2 : A, B or C), grades of seed (x3 : Premium,
P ; Regular, R; or Hybrid, H), show any signicant eects on Y , the yield (in
bushels per acre) of soybeans.
The following terminology is standard.
1. Response, Y : is the dependent variable; the objective of the experiment is
to measure this response and determine what contributes to its observed
value;
2. Factors: the independent variables whose eects on the response are to
be determined;
3. Level : the (possibly qualitative) value of the factor employed in the
experimental determination of a response; e.g. fertilizer type above has
two levels: I and II; farm location has 3 levels, A, B or C; seed grade has
3 levels, P, R, or H.
4. Treatment : the various factor-level combinations employed in the experiment, e.g. in the example given above, Fertilizer I, Farm Location A,
Seed Grade P constitutes one treatment. This example has 233 = 18
total possible treatments. For a single factor experiment, the levels of
this factor are the same as the treatments, by default.
796
19.2
Random Phenomena
Analysis of Variance
The principles behind ANOVA lie at the heart of all the computations
Design of Experiments
797
19.3
19.3.1
One-Way Classication
798
Random Phenomena
(19.2)
1 = 2 = . . . = k
= m for some and m.
(19.3)
(19.4)
j = 0
(19.5)
j=1
1 = 2 = . . . = k = 0
Ha :
(19.7)
Design of Experiments
799
TABLE 19.1:
experiment
Treatment
(Factor level)
y11
y21
..
.
y12
y22
..
.
y13
y23
..
.
..
.
y1j
y2j
..
.
..
.
y1k
y2k
..
.
yn1 1
T1
y.1
yn2 2
T2
y.2
yn3 3
T3
y.3
Total
Means
ynj j
Tj
y.j
ynk k
Tk
y.k
and distributed evenly among the observations, and not allowed to propagate
systematically.
Each treatment is repeated nj times, with the repeated observations known
as replicates. The design is said to be balanced if n1 = n2 = = nj = =
nk ; otherwise, it is unbalanced. It can be shown that for a xed total number
of experimental observations, N = n1 + n2 + + nj + + nk , the power of
the hypothesis test is maximized for the balanced design.
Analysis
We begin with the following denitions of some averages: the treatment
average,
nj
1
Y.j =
Yij
(19.8)
nj i=1
and the grand average,
k nj
k
1
Y.. =
Yij ; N =
nj
N j=1 i=1
j=1
(19.9)
Analyzing the data set in Table 19.1 is predicated on the following data decomposition
Yij = Y.. + (Y.j Y.. ) + (Yij Y.j )
(19.10)
The rst term, Y.. , is the grand average; the next term, (Y.j Y.. ), is the
deviation from the grand average due to any treatment eect; and the nal
term is the purely random, within-treatment deviation. (Compare with Eq
(19.6.) This expression in Eq (19.10) is easily rearranged to yield:
(Yij Y.. ) = (Y.j Y.. ) + (Yij Y.j )
EY = ET + EE
(19.11)
800
Random Phenomena
EY
EE
ET
(Yij Y.. )2
j=1 i=1
k
(Y.j Y.. )2 +
j=1
SSY
nj
k
(Yij Y.j )2
j=1 i=1
SST + SSE
(19.12)
(N k) 2
(k 1) 2 +
(19.14)
k
nj j
(19.15)
j=1
with the following very important consequence: If H0 is true, then from these
two equations, the following mean error sums of squares provide two independent estimates of 2
M SE
M ST
SSE
(N k)
SST
(k 1)
(19.16)
(19.17)
Design of Experiments
TABLE 19.2:
801
Source of
Variation
Between
Treatments
Within
Treatments (Error)
Total
(N k)
SSE
(N 1)
SSY
M SE
M ST
M SE
(19.18)
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
27
21
24
15
33
23
17
12
14
7
14
16
13
7
11
9
12
18
7
4
7
7
12
18
15
19
19
24
10
20
Solution:
To use MINITAB, we begin by entering the provided data into a
MINITAB worksheet, RAISINS.MTW; the sequence for carrying out the
required analysis is Stat > ANOVA > One-Way (Unstacked). The reason for selecting the Unstacked option is that as presented in this
802
Random Phenomena
Boxplot of Machine 1, Machine 2, Machine 3, Machine 4, Machine 5
35
30
Data
25
20
15
10
5
0
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
FIGURE 19.2: Boxplot of raisins data showing what the ANOVA analysis has conrmed
that there is a signicant dierence in how the machines dispense raisins.
table, the data for each machine is in a dierent column. This is in contrast to stacking the data for all machines in a single column to which
is attached, in another column, the numbers 1, 2, etc, as identiers for
the machine associated with the indicated data.
MINITAB provides several graphical options including box plots for
the data, and probability plots for assessing the normality assumption
for the residuals. The MINITAB results are summarized below, beginning with the ANOVA table.
Results for: RAISINS.MTW
One-way ANOVA: Machine 1, Machine 2, Machine 3, Machine
4, Machine 5
Source DF
SS
MS
F
P
Factor 4
803.0 200.7 9.01 0.000
Error
25 557.2
22.3
Total
29 1360.2
S = 4.721 R-Sq = 59.04% R-Sq(adj) = 52.48%
The specic value of the F -statistic, 9.01, indicates that the larger of the
two independent estimates of the variance, M ST , the treatment mean
squares, is nine times as large as the pure error estimate. It is therefore
not surprising that the associated p-value is 0 to three decimal places.
Therefore we must reject the null hypothesis (at the signicance level
of = 0.05) and conclude that there is a systematic dierence in how
each machine dispenses raisins. A boxplot of the data is shown in Fig
19.2 for each machine where, visually, we see that Machines 1 and 5
appear to dispense more raisins than the others.
Design of Experiments
803
Let us now recall the postulated model for the single-factor experiment in Eq (19.2); treated like a regression model in which 5 parameters
j ; i = 1, 2, . . . , 5, are estimated from the supplied data (representing
the mean number of raisins dispensed by Machine j), MINITAB provides values for the estimated pure error standard deviation, S = 4.72,
2
as well as R2 and Radj
values. These have precisely the same meaning as in the regular regression problems discussed in Chapter 16. In
addition, the validity of the normality assumptions can be assessed by
examining the residuals ij . The normal probability plots for the estimated residuals are shown in Fig 19.3. The top plot is the typical plot
obtained directly from the ANOVA dialog in MINITAB; it shows only
the residuals and the best normal distribution t. A visual assessment
indicates that the normality assumptions seems to be valid. However,
for a more rigorous assessment, MINITAB also provides the option of
saving the residuals for further analysis. If this is done, and the rigorous probability model goodness-of-t test is carried out in conjunction
with the probability plot, the result is shown in the bottom panel. The
p-value associated with the A-D test is quite high (0.81) leading us to
conclude that the normality assumption indeed appears to be adequate.
804
Random Phenomena
95
90
Percent
80
70
60
50
40
30
20
10
5
-10
-5
0
Residual
10
Mean
StDev
N
AD
P-Value
95
90
-1.18424E-16
4.383
30
0.223
0.810
Percent
80
70
60
50
40
30
20
10
5
-15
-10
-5
0
Residuals
10
FIGURE 19.3: Normal probability plots of the residuals from the one-way classication
ANOVA model in Example 19.1. Top panel: Plot obtained directly from the ANOVA
analysis which does not provide any test statistic or signicance level; Bottom panel:
Subsequent goodness-of-t test carried out on saved residuals; note the high p-value
associated with the A-D test.
Design of Experiments
805
19.3.2
19.3.3
Two-Way Classication
806
Random Phenomena
TABLE 19.3:
yr1
T1
y.1
yr2
T2
y.2
yr3
T3
y.3
yrj
Tj
y.j
single-factor,
Means
..
.
y1k
y2k
..
.
y1.
y2.
..
.
yrk
Tk
y.k
yr.
y..
(19.19)
j = 0;
r
i = 0
(19.20)
i=1
Because we are primarily concerned with identifying the presence of a treatment eect, the associated hypotheses are as before in Eq (19.7):
H0 :
Ha :
1 = 2 = . . . = k = 0
= 0 for at least one .
Design of Experiments
EY
807
EE
EB
ET
=
=
(19.21)
where, the vector EY of the data deviation from the grand average, is decomposed orthogonally into three components: the vectors ET , EB and EE , the
components due, respectively, to the treatment eect, the block eect, and
random error, as illustrated in Fig 19.4. The sum of squares, SSY , the total variability in the data, is thus decomposed into SST , the component due
to treatment-to-treatment variability; SSB the component due to block-toblock variability; and SSE , the component due to pure error. This is a direct
extension of the ANOVA identity shown earlier.
It is important to note that with this strategy, the SSB component has
been separated out from the desired SST component. And now, whether or
not H0 is true, it can be shown that:
E(SSE )
= (k 1)(r 1) 2
E(SST )
= (k 1) 2 +
E(SSB )
= (r 1) 2 +
k
j=1
r
(19.22)
(19.23)
(19.24)
i=1
so that if H0 is true, then from these equations, the following mean error sums
808
Random Phenomena
TABLE 19.4:
Source of
Variation
Between
Treatments
Blocks
Error
Total
SSB
SSE
M SB
M SE
M SB /M SE
SSY
M ST
M SB
SSE
(k 1)(r 1)
SST
(k 1)
SSB
(r 1)
(19.25)
(19.26)
(19.27)
M ST
M SE
(19.28)
Design of Experiments
809
is Car Wheel and (v) the number of blocks (or levels) is 4, specically Wheel 1: Front Left; Wheel 2: Front Right; Wheel 3: Rear Left;
and Wheel 4: Rear Right. The data layout is shown in the table below.
Determine whether or not the dierent tire brands wear dierently.
Factor
Blocks
Wheel 1
Wheel 2
Wheel 3
Wheel 4
Means
Means
47
45
42
48
45.5
46
43
37
50
44.00
52
51
49
55
51.75
48.33
46.33
42.67
51.00
47.08
Solution:
To use MINITAB, the provided data must be entered into a MINITAB
worksheet in staked format, as shown in the table below:
Tire Wear
Wheel Number
Tire Brand
47
46
52
45
43
51
42
37
49
48
50
55
1
1
1
2
2
2
3
3
3
4
4
4
A
B
C
A
B
C
A
B
C
A
B
C
The sequence for carrying out the required analysis is Stat > ANOVA
> Two-Way, opening a self-explanatory dialog box. The MINITAB results are summarized below.
Two-way ANOVA:
Brand
Source
DF
Wheel Number 3
Tire Brand
2
Error
6
Total
11
MS
36.9722
67.5833
3.1389
F
11.78
21.53
P
0.006
0.002
810
Random Phenomena
Normal Probability Plot
(response is Tire Wear)
99
95
90
Percent
80
70
60
50
40
30
20
10
5
-3
-2
-1
0
Residual
FIGURE 19.5: Normal probability plots of the residuals from the two-way classication
ANOVA model for investigating tire wear, obtained directly from the ANOVA analysis.
in how each tire wears; and we have been able to identify this eect
without contamination from the wheel number eect, which is itself
also signicant.
Again, viewed as a regression model, seven parameters, the 3 treatment eects, j ; j = 1, 2, 3, and the 4 block eects, i , i = 1, 2, 3, 4,
can be estimated from the two-way classication model in Eq (19.19).
MINITAB provides values for the estimated pure error standard deviation, S = 1.772; it also provides values for R2 = 92.89% and
2
= 86.97%. These latter values show that a signicant amount
Radj
of the variation in the data has been explained by the two-way classi2
value arises from estimating 7
cation model. The reduction in the Radj
parameters from a total of 12 data points, with not too many degreesof-freedom left. Still, these values are quite decent.
The normal probability plots for the estimated residuals, obtained
directly from the ANOVA dialog in MINITAB, are shown in Fig 19.5.
A visual assessment indicates that the normality assumptions appears
valid. It is left as an exercise to the reader to carry out the more rigorous
assessment by saving the residuals and then carrying out the rigorous
probability model goodness-of-t separately (Exercise 19.7).
It is important to note that in the last example, both brand eect and
wheel eect were found to be signicant. Using wheels as a blocking
variable allowed us to separate out the signicant wheel eect from the real
object of the investigation; without blocking, the wheel eect would have
been compounded with the brand eect. This could have serious repercussions
particularly when the primary factor has no eect and the blocking factor has
a signicant eect.
Design of Experiments
19.3.4
811
Other Extensions
In the two-way classication case above, the key issue is that in addition to the single factor of interest, there was another variable a so-called
nuisance variable that could potentially contaminate our analysis. In general, with single factor experiments, there is the possibility of more nuisance
variables, but the approach remains the same: block on nuisance variables.
When there is only one nuisance variable, the appropriate design, as we have
seen, is the randomized complete block design, with analysis provided by the
two-way classication ANOVA (one primary factor; one nuisance variable).
With two blocking variables, the appropriate design is known as the Latin
Square design, leading to a three-way classication ANOVA (one primary
factor; two nuisance variables). With three blocking variables, we use the
Graeco-Latin Square design, and a four-way classication ANOVA (one primary factor; three nuisance variables) etc. A discussion of these and other
such designs lie outside the intended scope of this introductory chapter. (See
Box, Hunter and Hunter, 20051)
19.4
Two-Factor Experiments
812
Random Phenomena
i = 0;
i=1
b
j=1
i = 0;
a
b
ij = 0
(19.30)
i=1 j=1
1 = 2 = . . . = a = 0
H0
H0
1 = 2 = . . . = b = 0
ij = 0; i, j
:
:
(19.31)
i =
0 for at least one i
j =
0 for at least one j
Ha :
(19.32)
Design of Experiments
TABLE 19.5:
Data table
two-factor experiment
Factor B
1
2
Factor A
1
Y111 Y121
1
Y112 Y122
..
..
..
.
.
.
1
Y11r Y12r
2
Y211 Y221
2
Y212 Y222
..
..
..
.
.
.
TABLE 19.6:
Source of
Variation
Main Eect A
Main Eect B
2-factor
Interaction AB
Error
Total
2
..
.
Y21r
..
.
Y22r
..
.
a
a
..
.
Ya11
Ya12
..
.
Ya21
Ya22
..
.
Ya1r
Ya2r
813
for typical
Y1b1
Y1b2
..
.
Y1br
Y2b1
Y2b2
..
.
Y2br
..
.
Yab1
Yab2
..
.
Yabr
814
19.5
Random Phenomena
19.6
19.6.1
2k factorial designs are used to study the eect of k factors (and their
interactions) simultaneously rather than one-at-a-time. The signature characteristic is that they involve only two levels of each factor, (Low, High). This
restriction endows these designs with their key advantage: they are very economical, allowing the extraction of a lot of information with relatively few
experiments. There is also the peripheral advantage that the fundamental na-
Design of Experiments
High
B
Low
Low
815
A
+
+
High
FIGURE 19.6: 22 factorial design for factors A and B showing the four experimental
points; represents low values, + represents high values for each factor.
ture of the design lends itself to computational shortcuts. However, this latter
point is no longer of consequence, having been rendered irrelevant by modern
computer software. Finally, as we show a bit later, 2k factorial designs are
easily adapted to accommodate experiments involving only a fraction of the
total 2k experiments, especially when k is so large that even the reduced set
of 2k experiments becomes untenable.
For all their advantages, 2k factorial experiments also have some important disadvantages. The most obvious is that by restricting attention only to
two levels, we limit our ability to conrm the presence of non-linear responses
to changes in factor levels. The underlying assumption is that the relationship
between the response, Y , and the factors, xi , is approximately linear (plus
some possible interaction terms) over the range of the chosen factor levels.
When this is a reasonable assumption, nothing is more ecient than 2k factorial designs. In many practical applications, the recommendation is to use
the 2k factorial designs (or fractions thereof, see later) to begin experimental
investigations and then to augment with additional experiments if necessary.
Notation and Terminology
Because they involve investigations at precisely two levels of each factor, it is customary to use or 1 to represent the Low level and + or
+1 to represent the High level of each factor. In some publications including journal articles and textbooks, lower case letters a, b, c, . . . are used
to represent the factors, and treatment combinations are represented as:
(1), a, b, ab, c, ac, bc, abc, . . . representing, respectively, the all low, A only
high (every other factor low), B only high, A and B high, etc.
For example, a 22 factorial design involving two factors A and B is shown
in Fig 19.6. The design calls for a base collection of four experiments, the rst,
(, ) representing the (Low, Low) combination; the second, the (High, Low)
combination; the third, the (Low, High) combination, and nally, the fourth,
the (High, High) combination. A concrete illustration and application of this
design is presented in the upcoming Example 19.4.
816
Random Phenomena
Characteristics
The 2k factorial design enjoys some desirable characteristics that make it
particularly computationally attractive:
1. It is balanced: in the sense that there is an equal number of highs and
lows for each factor. If the factor terms in the design are represented
by 1, then, for each factor,
(19.33)
xi = 0
For example, summing down the column for each factor A and B in the
table on the right hand side of Fig 19.6 shown this clearly.
2. It is orthogonal: in the sense that the sum of the products of the coded
factors (coded as 1) is zero, i.e.,
(19.34)
xi xj = 0; i = j
Multiplying column A by column B in Fig 19.6 and summing conrms
this for this 22 example.
These two characteristics simplify analysis, allowing the separation of eects,
and making it possible to estimate each eect independently. For example,
with the 22 design of Fig 19.6, from Run #1 (, ) to Run #2 (+, ), only
factor A has changed. The dierence between the observations, y1 for Run #
1 and y2 for Run #2, is therefore a reection of the eect of changing A (while
keeping B at its low value). The same is true for Runs # 3 and 4, but at the
high level of B; in which case (y3 y4 ) provides another estimate of the main
eect of factor A. This main eect is therefore estimated from the results of
the 22 design as:
Main Eect of A =
1
[(y2 y1 ) + (y3 y4 )]
2
(19.35)
The other main eect and the two-way interaction can be computed from
similar considerations made possible by the transparency of the design.
When experimental data were analyzed by hand, these characteristics led
to the development of useful computational shortcuts (for example the popular
Yates algorithm); this is no longer necessary, because of computer software
packages.
19.6.2
k
i=1
i xi +
k
k
i=1 j=1
ij xi xj +
k
k
k
i=1 j=1 =1
ij xi xj x + + (19.36)
Design of Experiments
817
a somewhat unwieldy looking equation that is actually quite simple for specic
cases. The parameter 0 represents the grand average; i is the coecient
related to the main eect of factor i; the double subscripted parameter ij is
related to the two-way interaction eect of the ith and j th factors; the triple
subscripted parameter ij is related to the three-way interaction eect of the
ith , j th and th factors, etc. Simpler, specic cases are illustrated with the
next example.
Example 19.3: TWO-FACTOR AND THREE-FACTOR 2k
FACTORIAL MODELS
Write the postulated 2k factorial models for the following two practical
experimental cases:
(1) Studying the growth of epitaxial layer on silicon wafer by chemical vapor deposition (CVD), where the response of interest, the epitaxial
layer thickness, Y , is believed to depend on two primary factors: deposition time, x1 , and arsenic ow rate, x2 .
(2) Characterizing the solubility, Y , of a globular protein as a function of three primary factors: pH, x1 ; salt concentration, x2 ; and temperature, x3 .
Solution:
The models are as follows: for the CVD experiment,
y = 0 + 1 x1 + 2 x2 + 12 x1 x2 +
(19.37)
0 + 1 x1 + 2 x2 + 3 x3 + 12 x1 x2 + 13 x1 x3
+23 x2 x3 + 123 x1 x2 x3 +
(19.38)
19.6.3
Procedure
The general procedure for carrying out 2k factorial experiments is summarized here:
818
Random Phenomena
1. Create Design:
Given the number of factors, and the low and high levels for each factor,
use computer software (e.g. MINITAB, SAS, ...) to create the design.
Using these software packages to create the design is straightforward and
intuitive; the result is a design table showing each treatment combination
and the recommended run sequence. It is highly recommended to run the
experiments in random order, just as with the completely randomized
single-factor designs.
Creating the design requires determining how many replicates to run.
We present some recommendations shortly.
2. Perform Experiment:
This involves lling in the results of each experiment in the data sheet
created above.
3. Analyze Data:
Once the data is entered into the created design worksheet, all statistical
packages will carry out the factorial analyses, generating values for the
main eects and interactions, and providing associated signicance
levels. It is also recommended to carry out diagnostic checks to validate
the underlying Gaussian distributional assumption.
Sample Size Considerations
The discussion in Section 15.5 of Chapter 15 included practical considerations for determining sample sizes for carrying out hypothesis tests regarding means of normal populations. The same considerations can be extended
for balanced 2-level factorial designs. In the same spirit, let represent the
smallest shift from zero we wish to detect in these eects (i.e., the smallest
magnitude of the eect worth detecting). With , the standard deviation of
the random error component, , usually unknown ahead of time, we can invoke
the denition of the signal-to-noise ratio given in Chapter 15, i.e.,
SN =
(19.39)
0.5
1.0
1.5
2.0
196-256 49-64 22-28 12-16
Design of Experiments
819
Alternatively, the sequence, Stat > Power and Sample Size > 2-level
Factorial Design in MINITAB will also provide recommended values.
The following example illustrates these principles.
Example 19.4: 22 FACTORIAL INVESTIGATION OF CVD
PROCESS2
In an experimental study of the growth of epitaxial layer on silicon wafer
by chemical vapor deposition (CVD), the eect of deposition time and
arsenic ow rate on the epitaxial layer thickness was studied in a 22
factorial experiment using the following settings:
Factor A: Deposition time (Short, Long)
Factor B: Arsenic Flow Rate (55%; 59%)
Designing for a signal-to-noise ratio of 2 leads to a recommended sample size 12 < n < 16, which calls for a minimum of three full replicates
of the four basic 22 experiments (and a maximum of four replicates).
The 22 factorial design, in 3 replicates, and the experimental results are
shown in the table below. Analyze the data and comment on the results.
RunOrder
1
2
3
4
5
6
7
8
9
10
11
12
Depo.
Time
Arsenic
Flow rate
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Epitaxial
Layer Thick.
14.04
14.82
13.88
14.88
14.17
14.76
13.86
14.92
13.97
14.84
14.03
14.42
Solution:
First, to create the design using MINITAB, the required sequence is
Stat > DOE > Factorial > Create Factorial Design, which opens a
dialog box where all the characteristics of the problem are specied.
The result, which should be stored in a worksheet, is the set of 12
treatment combinations arising from 3 replicates of the base 22 design.
It is highly recommended to run the actual experiments in random
order. (MINITAB can provide a randomized order for the experiments
upon selecting the Randomize runs option.)
Once the experiments have been completed and the epitaxial
layer thickness measurements entered into the worksheet (as shown in
the table above), the sequence for analyzing the data is Stat > DOE
> Factorial > Analyze Factorial Design. The dialog box contains
2 Adapted
820
Random Phenomena
many options and the reader is encouraged to explore them all. The
basic MINITAB output is shown here.
Factorial Fit: Epitaxial Layer versus Depo. Time, Arsenic Flow
rate
Estimated Effects and Coefficients for Epitaxial Layer Thick.
(coded units)
Term
Effect
Coef
SE Coef
T
P
Constant
14.3825 0.04515 318.52 0.000
Depo. Time
0.7817
0.3908
0.04515
8.66 0.000
Arsenic Flow rate
-0.1017 -0.0508 0.04515
-1.13 0.293
Depo. Time*
0.0350
0.0175
0.04515
0.39 0.708
Arsenic Flow rate
S = 0.156418 R-Sq = 90.51% R-Sq(adj) = 86.96%
This MINITAB output lists both the coecient and eect associated with each factor and it should be obvious that one is twice
the other. The coecient represents the parameters in the factorial
model equation, e.g. as in Eq (19.37) and (19.38); from these equations,
we see that each coecient represents the change in y for a unit change
in each factor xi . On the other hand, what is known as the eect is
the change in y when each factor, xi , changes from its low value to its
high value; i.e., from 1 to +1. Since this is a change of two units in
x, the eects are therefore twice the magnitude of the coecients. The
constant term is unaected since it is not multiplied by any factor; it is
the grand average of the data.
Next, we observe that for each estimated parameter, there is an
associated t-statistic and corresponding p-value. These are determined
in the same manner as discussed in earlier chapters, using the Gaussian
assumption and the mean square error, M SE , as an estimator of S,
the standard deviation. The null hypothesis is that all coecients are
identically zero. In this specic case, only the constant term and the
Deposition time eect appear to be signicantly dierent from zero;
the Arsenic ow rate term and the two-factor interaction term appear
not to be signicant at the = 0.05 level. MINITAB also lists the
2
none of which appear to be
usual regression characteristics, R2 ; Radj
particularly indicative of anything unusual.
Finally, MINITAB also produces an ANOVA table shown here. This
is a consolidation of the detailed information already shown above. It
indicates that the main eects, as a composite (not separating one main
eect from the other) are signicant (p-value is 0.000); the two-way interaction is not (p-value is 0.708).
Analysis of Variance for Epitaxial
Source
DF Seq SS
Main Effects
2 1.86402
2-Way Interactions 1 0.00368
Residual Error
8 0.19573
Total
11 2.06343
Design of Experiments
821
(19.41)
19.6.4
Closing Remarks
The best applications of 2k factorial designs are for problems with relatively small number of factors, say k < 5; and when the relationship between
y and the factors is reasonably linear in the Low-High range explored experimentally, with non-linearities limited to cross-terms in the factors (no
quadratic or higher order eects). The direct corollary is that practical problems not suited to 2k factorial designs include those with a large number
of potentially important factors; and those for which nonlinear relationships
are important to the investigation, for example, for applications in process
optimization.
For the latter group of problems, specic extensions of the 2k factorial
designs are available to deal with each problem:
1. Screening Designs (for a large number of factors);
2. Response Surface Methods (for more complex modeling and optimization).
We deal next with these designs in turn, beginning with screening designs.
Screening designs are experimental designs specically calibrated for selecting from among a large group of potential factors, only the few that are
truly important, prior to carrying out more detailed experiments. Such designs therefore involve signicantly fewer experimental runs than required by
the 2k factorial designs. They are created to avoid the expenditure of a lot
of experimental eort since the objective is quicker decisions on factor importance, rather than detailed characterization of eects on responses. The two
most popular categories are Fractional Factorial Designs and Plackett-Burman
designs.
While screening designs involve running fewer experiments than called for
by the 2k factorial designs, response surface designs involve judiciously adding
more experimental runs, to capture more complex relationships better.
822
19.7
19.7.1
Random Phenomena
19.7.2
C
1
1
1
1
1
1
1
1
Now suppose that we are willing to give up the ability to determine the
three-way interaction ABC, in return for being able to investigate the eect
of D also. In the language of fractional factorial design, this is represented as:
D = ABC
(19.42)
an expression to which we shall return shortly. It can be shown that the code
corresponding to ABC is obtained from a term-by-term multiplication of the
Design of Experiments
823
signs in the columns A, B, and C in the design table. Thus, for example,
the rst entry for run #1 will be 1 (= 1 1 1), run #2 will be 1
(= 1 1 1), etc. The result is the updated table shown here.
Run # A B C D
1
1 1 1 1
2
1 1 1
1
3
1
1 1
1
4
1
1 1 1
5
1 1
1
1
6
1 1
1 1
7
1
1
1 1
8
1
1
1
1
Thus, by giving up the ability to estimate the three-way interaction ABC,
we have obtained the 8-run design shown in this table for 4 factors, a 50%
reduction in the number of runs (i.e., a half-fraction of what should have been
a full 24 factorial design). This seems like a reasonable price to pay; however,
this is not the whole cost. Observe that the code for the two-way interaction
AD (obtained by multiplying each term in the A and D columns) written
horizontally, is (1, 1, 1, 1, 1, 1, 1, 1) which is precisely the same as the
code for the two-way interaction BC! But even that is still not all. It is left
as an exercise to the reader to show that the code for AB is also the same as
that for CD; similarly, the codes for AC and for BD are also identical.
Observe therefore that for this problem,
1. The primary trade-o, D = ABC, allowed an eight-run experimental
design for a 4-factor system, precisely half of the 16 runs ordinarily
required for a full 24 design for 4 factors;
2. But D = ABC is not the only resulting trade-o; other trade-os include
AD = BC, AB = CD; and AC = BD; plus some others;
3. The implication of these secondary (collateral) trade-os is that these
two-way interactions, for example, AD and BC, are now confounded,
being indistinguishable from each other; they cannot be estimated independently.
Thus, when we give up some high-order interactions to estimate some other
factors, we also lose the ability to estimate other additional eects independently.
19.7.3
General characteristics
824
Random Phenomena
design is a (2p ) fraction of the full 2k design. For example, a 252 design is
a quarter fraction of the full 25 design which consists of 8 total runs (1/4 of
the full 32).
As illustrated above, the reduction in the total number of runs in 2kp
designs is achieved at a cost; this cost of fractionation, the confounding
of two eects so that they cannot be independently estimated, is known as
Aliasing. And for every fractional factorial design, there is an accompanying
alias structure, a complete listing of what is confounded with what. Such
alias structures can be determined from what is known as the dening relation,
an expression, such as the one in Eq (19.42) above, indicating the primary
trade-o.
There are simple algebraic rules for determining alias structures. For instance, upon multiplying both sides of Eq (19.42) by D an using the simple
rule that DD = I, the identity column, we obtain,
I = ABCD
(19.43)
This is the dening relation for this particular fractional factorial design. The
additional aliases can be obtained using the same algebraic rule: upon multiplying both sides of Eq (19.43) by A, and then by B, and then C, we obtain:
A = BCD; B = ACD; C = ABD
(19.44)
showing that, like the main eect D, the other main eects A, B, and C are
also confounded with the indicated three-way interactions. From here, upon
multiplying the expressions in Eq (19.44) by the appropriate letters, we obtain
the following additional aliases:
AB = CD; AC = BD; AD = BC
(19.45)
Observe that for this design, main eects are confounded with 3-way interactions only; and 2-way interactions are confounded with other 2-way interactions.
Design Resolution
The order of the eects that are aliased is captured succinctly in the design resolution. For example, the illustration used above, is a Resolution
IV design because 2-way interactions are confounded with other 2-way interactions, and main eects (1-way interactions) are confounded with 3-way
interactions.
The resolution of a design is typically represented by roman numerals, III,
IV, V, etc.; they dene the cost of fractionation. The higher the design
resolution, the better we are able to determine main eects, two-way interactions (and even 3-way interactions) independently. The following notation is
4
typical: 241
IV represents a Resolution IV, half fraction of a 2 factorial design,
such as the illustrative example used above. On the other hand, a 252
III design
is a Resolution III quarter fraction of a 25 design.
Design of Experiments
825
19.7.4
Basic Principles
The design and analysis of fractional factorial designs are very similar to
those for the basic factorial designs. Here are the key points of interest.
1. Planning: Determine k (total number of factors), and p (extent of fractionation), giving the total number of unreplicated runs; determine how
many replicates are required using the rule-of-thumb specied above or
the MINITAB power and sample size feature.
2. Design: Given the above information, typical computer software packages will display available designs and their respective resolutions; select
appropriate resolution (recommend IV or higher where possible). The
computer program will then generate the design and the accompanying
alias structure. It is important to save both into a worksheet.
3. Analysis: This follows the same principles as the full 2k designs, but in
interpreting the results, keep aliases in mind.
Before dealing with a comprehensive example, there are two more concepts of
importance in engineering practice we wish to discuss.
Projection and Folding
Consider, for example, that one or more factors in a half fraction of a 2k
design (i.e., a 2k1 design) is not signicant, we can then project the data
826
Random Phenomena
Alternate Half-fraction
Original Half-fraction
Combined Fold-Over
Design
FIGURE 19.7: Graphic illustration of folding where two half-fractions of a 23 factorial design are combined to recover the full factorial design; each fold costs an additional
degree of freedom for analysis.
down to a full factorial in the signicant k 1 variables. For instance, imagine that after completing the experiments in the 4-factor illustrative example
presented above, we discover that D was not signicant after all; observe that
the data can then be considered as arising from a full 23 factorial design in
A, B and C. If, for example, both C and D are not signicant, then the data
can be projected down two full dimensions so that the 8 experiments will be
considered as 2 full replicates of a full 22 factorial design in the signicant
factors A and B. Everywhere the factor C was investigated therefore becomes
a replicate. Where the opportunity presents itself, projection therefore always
strengthens the experimental analysis in the remaining factors. This is illustrated shortly with a concrete example.
The reverse is the case with folding combining lower fraction designs
into higher fraction ones. For example, combining two 1/2 fractions into a
full factorial design, or two 1/4 fractions into a 1/2 fractional factorial design.
This is illustrated in Fig 19.7. This strategy is employed if, after the analysis of
the fractional factorial design results, we discover that some of the confounded
interactions and eects are important enough to be determined independently.
Folding increases the design resolution by providing additional information
required to resolve aliases. However, each fold costs an additional degree of
freedom for analysis.
Design of Experiments
827
19.7.5
Problem Statement
The problem involves a single-wafer plasma etcher process which uses the
reactant gas, C2 F6 . The response of interest is the etch rate for Silicon Nitride
(in
A/min), which is believed to be dependent on the 4 factors listed below.
A: Gap; cm. (The spacing between the Anode and Cathode)
B: Reactor Chamber Pressure; mTorr.
C: Reactant Gas (C2 F6 )Flow rate; SCCM
D: Power (applied to Cathode); Watts.
The objective is to determine which factors aect etch rate by investigating
the process response at the values indicated below for the factors using a 241
design with no replicates.
Variable
Levels
1 + 1
A. Gap (cm)
0.8 1.2
B. Pressure (mTorr)
450 550
C. Gas Flow Rate (sccm) 125 200
D. Power (Watts)
275 325
Design and Data Collection
The required design is created in MINITAB using the sequence Stat >
DOE > Factorial > Create Factorial Design > and entering the problem
characteristics. MINITAB returns both the design (which is saved into a worksheet) and the following characteristics, including the alias structure.
Fractional Factorial Design
Factors:
4 Base Design:
4, 8 Resolution: IV
Runs:
8 Replicates:
1
Fraction: 1/2
Blocks: none Center pts (total):
0
Design Generators: D = ABC
Alias Structure
I + ABCD
A + BCD
828
Random Phenomena
B + ACD
C + ABD
D + ABC
AB + CD
AC + BD
AD + BC
Note that this is precisely the same design generator and alias structure as in
the Resolution IV example used above to illustrate the mechanics of fractional
factorial design generation.
The design table, along with the data acquired using the design are shown
below:
Std
Run Gap Pressure Gas
Order Order
Flow
1
5
0.8
450
125
2
7
1.2
450
125
3
1
0.8
550
125
4
8
1.2
550
125
5
6
0.8
450
200
6
4
1.2
450
200
7
2
0.8
550
200
8
3
1.2
550
200
550
749
1052
650
1075
642
601
729
Note the dierence between the randomized order in which the experiments
were performed and the standard order.
Analysis Part 1
To analyze this data set, the sequence Stat > DOE > Factorial
> Analyze Factorial Design > opens a dialog box with several selfexplanatory options. Of these, we draw particular attention to the button
labeled "Terms." Upon selecting this button, a further dialog box is opened
in which the terms to be included in the analysis are shown. It is interesting
to note that the default already selected by MINITAB shows only the four
main eects, A, B, C, D and three two way interactions AB, AC and AD. The
reason, of course, is that everything else is aliased with these terms. Next,
the button labeled "Plots" allows one to select which plots to display. For
reasons that will become clearer later, we select for the Eect Plots the
Normal Plot option. The button labeled "Results" shows what MINITAB
will include in the output estimated coecients and ANOVA table, Alias
table with default interactions, etc. The results for this particular analysis are
shown here.
Results for: Plasma.MTW
Factorial Fit: Etch Rate versus Gap, Pressure, Gas Flow, Power
Design of Experiments
829
830
Random Phenomena
Normal Plot of the Effects
(response is Etch Rate, Alpha = .10)
99
Effect Ty pe
Not Significant
Significant
95
D
90
Percent
80
70
F actor
A
B
C
D
N ame
G ap
P ressure
G as F low
P ow er
60
50
40
30
A
20
10
AD
-200
-100
100
200
300
Effect
FIGURE 19.8: Normal probability plot for the eects, using Lenths method to identify
A, D and AD as signicant.
Such a probability plot for this data set (using Lenths method to determine
the signicance eects) is shown in Fig 19.8, where the eects A and D, respectively, Gap and Power, are identied as important, along with the two-way
interaction AD (i.e., Gap*Power).
At this point, it is important to pause and consider the repercussions of
aliasing. According to the alias structure for this Resolution IV design, AD =
BC, i.e., it is impossible from this data, to distinguish between the Gap*Power
interaction eect and the Pressure*Gas Flow interaction eect. Thus, is the
identied signicant interaction the former (as indicated by default), or the
latter? This is where domain knowledge becomes crucial in interpreting the
results of fractional factorial data analysis. First, from pure common sense, if
Gap and Power are identied as signicant factors, it is far more natural and
more likely that the two-way interaction that is also of signicance will be
Gap*Power and not Pressure*Gas Flow. This common sense conjecture is in
fact corroborated by the physics of the process: the spacing between the anode
and cathode (Gap) and the power applied to the cathode are far more likely
to inuence etch rate than the interaction between Pressure and Gas Flow
rate, especially when none of these individual factors appear important by
themselves. We therefore conclude, on the basis of the factorial analysis, aided
especially by the normal probability plot of the eects that Gap and Power
are the important factors that aect etch rate. This nding presents us with a
fortuitous turn of events: we started with 4 potential factors, performed a set of
8 experiments based on a 241 fractional factorial design (with no replicates),
and discovered that only 2 factors are signicant. This immediately suggests
projection. By projecting down onto the two relevant dimensions represented
Design of Experiments
831
(19.46)
The R2 value and other aliated measures of the variability in the data explained by this simple model, are also contained in the MINITAB results
shown above. These values indicate, among other things, that the amount of
832
Random Phenomena
Normal Probability Plot
(response is Etch Rate)
99
95
90
Percent
80
70
60
50
40
30
20
10
5
-40
-30
-20
-10
0
Residual
10
20
30
40
FIGURE 19.9: Normal probability plot for the residuals of the Etch rate model in Eq
(19.46) obtained upon projection of the experimental data to retain only the signicant
terms A, Gap (x1 ), D, Power (x2 ), and the interaction AD, Gap*Power (x1 x2 ).
the variabilty in the data explained by this simple model is quite substantial. A normal probability plot of the estimated model residuals is shown in
Fig 19.9, where visually, we see no reason to question the normality of the
residuals.
19.8
Design of Experiments
833
factorial designs can now be lled by what has become known appropriately
as Plackett-Burman (PB) designs because they involve 2r experimental runs,
where fractional factorial designs involve 2r experimental runs. Thus, with PB
designs, experimental runs of sizes N = 12, 20, 24, 28, etc. are now possible.
19.8.1
Primary Characteristics
PB designs are also 2-level designs (like fractional factorials), but where
fractional factorial designs involve runs that are powers of 2, PB designs have
experimental runs that are multiples of 2. All PB designs are of Resolution III,
however, so that all main eects are aliased with two-way interactions. They
should not be used therefore when 2-way interactions might be as important
as main eects.
PB designs are best used to screen for critical main eects when the number of potential factors is large. The primary advantage is that they involve
remarkably few runs for a large number of factors; they are therefore extremely
ecient and cost-eective. For example, the following is the design table for
a PB design of 12 runs for k = 7 factors!
Run # A B C D
1
+ +
2
+ + +
3
+ +
4
+ + +
5
+ + +
6
+ + +
7
+ + +
8
+ +
9
+
10
+
11
+
12
+
+
+
+
+
F G
+
+
+
+
+ +
+
+
+ +
+ +
The main disadvantages have to do with their resolution: with PB designs, it is dicult, if not entirely impossible, to determine interaction eects
independently. Furthermore the alias structures are quite complicated (but
important). The designs have also been known to be prone to poor precision,
but this can be mitigated with the use of replicates. It is recommended that
computer software be used for both design and analysis of experiments using
the PB strategy.
19.8.2
Programs such as MINITAB will generate PB designs and the accompanying alias structure. Because they are orthogonal two-level designs, the analysis
834
Random Phenomena
is similar to that for factorials. But, we emphasize again, that these are best
carried out using computer programs.
Additional discussions are available in Chapter 7 of the Box, Hunter and
Hunter, (2005) reference provided earlier. An application of PB designs in
biotechnology discussed in Balusu, et al., 20045 is highly recommended to the
interested reader.
19.9
Frequently, the objective of the experimental study is to capture the relationship between the response y and the factors xi mathematically so that
the resulting model can be used to optimize the response with respect to the
factors. Under these circumstances, for such models to be useful, they will
have to include more than the approximate linear ones possible with two-level
designs. Response surface methodology is the approach for obtaining models
of the sort required for optimization studies. They provide the designs for eciently tting more complex models to represent the relationship between the
response, Y , and the factors, with the resulting model known as the response
surface. A detailed discussion is impossible in this lone section of an introductory chapter devoted to the topic. The classic reference is the book by Box
and Draper6 that the interested reader is encouraged to consult. What follows
is a summary of the salient features of this experimental design strategy that
nds important applications in engineering practice.
19.9.1
Characteristics
Design of Experiments
Run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
X1
+
+
+
+
+
0
0
0
0
0
0
0
X2
+
+
+
+
0
0
+
0
0
0
0
0
X3
+
+
+
+
0
0
0
0
+
0
0
0
835
23 Factorial
X3
Face
Centers
Center
Points
X2
X1
FIGURE 19.10: The 3-factor face-centered cube (FCC) response surface design and
its constituent parts: 23 factorial base, Open circles; face center points, lighter shaded
circles; center point, darker solid circle.
surface curvature. For example, the typical postulated model for two-factors
is
(19.47)
y = 0 + 1 x1 + 2 x2 + 12 x1 x2 + 11 x21 + 22 x22 +
19.9.2
836
Random Phenomena
Run
X1
X2
X3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
+
+
+
+
0
0
0
0
0
0
0
+
+
0
0
0
0
+
+
0
0
0
0
0
0
0
+
+
+
+
0
0
0
X1 X2
Factorial
X3
X1 X3
Factorial
X2 X3
Factorial
Center
Points
X2
X1
FIGURE 19.11: The 3-factor Box-Behnken response surface design and its constituent
parts: X1 , X2 : 22 factorial points moved to the center of X3 to give the darker shaded
circles at the edge-centers of the X3 axes; X2 , X3 : 22 factorial points moved to the
center of X1 to give the lighter shaded circles at the edge-centers of the X1 axes;
X1 , X3 : 22 factorial points moved to the center of X2 to give the solid circles at the
edge-centers of the X2 axes; the center point, open circle.
factor and point at the dead center of the cube. Thus, the rst four runs are
based on the four X1 , X2 factorial points moved to the center of X3 to give
the dark shaded circles at the edge-centers of the X3 axes. The next four are
based on the four four X2 , X3 factorial points moved to the center of X1 to
give the lighter shaded circles at the edge-centers of the X1 axes. The next
four likewise are based on the to give the X1 , X3 factorial points moved to
the center of X2 solid circles at the edge-centers of the X2 axes. The center
points are the open circles. The design also calls for three replicates of the
center points. Note that while the three-factor FCC design involves 17 points,
the Box-Behnken design involves 15.
19.9.3
Design of Experiments
837
19.10
19.10.1
Whether stated explicitly or not, all statistical experimental design strategies are based on assumed mathematical models. To buttress this point, at
every stage of the discussion in this chapter, we have endeavored to state explicitly the postulated model underlying each design. Most of the models we
have encountered have been relatively simple, involving at most polynomials
of modest order. What happens if an experimenter has available a model for
the system under investigation, and the objective is to estimate the unknown
model parameters using experimental data? Under such circumstances, the
appropriate experimental design should be one that provides for the given
model the best set of experimental conditions so that the acquired data
is most informative for the task at hand: estimating the unknown model
parameters. This is the motivation behind optimal designs.
If the postulated model is of the form:
y = X +
(19.48)
i.e., the matrix form of the general linear regression model, we may recall
that, given the data vector, y, and the design matrix X, the least-squares
estimate of the parameter vector is:
1
= XT X
XT y
(19.49)
Optimal experimental designs for problems of this kind are concerned with
determining values for the elements of the matrix X that will provide estimates
in Eq (19.49) that are optimal in some specic sense. Not surprisingly, the
optimality criteria are usually related to the matrix
FI = XT X
(19.50)
known as the Fisher information matrix.
7 Ryan,
838
Random Phenomena
19.10.2
The D-Optimal design selects values of the factors x to maximize XT X,
the determinant of the information matrix. This optimization criterion maximizes the information content of the data and hence of the estimate. It is
possible to show that for the factorial models in Eq (19.36), with k factors,
the D-Optimal design is precisely the 2k factorial design, where 1 and +1
represent, respectively, the left and right extremes of the feasible region for
each factor, giving a k-dimensional hypercube in the experimental space.
Other optimality criteria give rise to variations in the optimal design arsenal. Some of these are listed below, (in alphabetical order!), the optimality
criteria themselves and what they mean for the parameter estimates:
1
. And because
1. A-Optimal designs: minimize the trace of XT X
(as we may recall from Chapter 16),
1
= XT X
V ar()
2
(19.51)
with 2 as the error variance, then the A-optimality criterion minimizes the average variance of the estimated coecients, a desirable
property.
Since the inverse of a matrix is its adjoint divided by its determinant,
it should be clear therefore that the D-Optimality criterion minimizes
the general variance of the estimated parameters.
2. E-Optimal designs: maximize the smallest eigenvalue of XT X . This
is essentially the same as maximizing the condition number of the design matrix X, preventing the sort of ill-conditioning that leads to poor
estimation. This is a less-known, and less popular design criterion.
3. G-Optimal designs: minimize the maximum diagonal element of the
1
XT , which, as
hat matrix dened in Chapter 16 as H = X XT X
we may recall, is associated with the models predicted values. The Goptimality criterion minimizes the maximum variance associated with
.
the predicted values, y
4. V-Optimal designs: minimize the trace (sum of the diagonal elements)
of the hat matrix. As such, the V-optimality criterion minimizes the
.
average prediction variance associated with y
Even with such a simple linear model, the computations involved in obtaining these optimal designs are not trivial; and things become even more complicated when the models are non-linear in the parameters. Also, these optimal
designs have come under some criticism for various reasons (see Chapter 13 of
Ryan, (2007)). Nevertheless, they have been put to good use in some chemical
Design of Experiments
839
19.11
840
Random Phenomena
of statisticsgraphical analysis; estimation; hypothesis testing; regression; experimental designin real-life applications.
REVIEW QUESTIONS
1. In what sense is the term experimental studies used in this chapter?
2. What is an observational study?
3. What is the key distinguishing characteristic of the experimental studies of concern in this chapter?
4. What are the two basic tasks involved in the experimental studies discussed in
this chapter? What complicates the eective execution of these tasks?
5. How does statistical design of experiments enable ecient conduct of experimental studies?
6. List the phases of ecient experimental studies and what each phase entails.
7. In the terminology of statistical design of experiments, what is a response, a factor, a level, and a treatment?
8. When more than two population means are to be compared simultaneously, why
are multiple pairwise t-tests not recommended?
9. What is ANOVA, and on what is it predicated?
10. What are the two central assumptions in ANOVA?
11. What is a one-way classication of a single factor experiment, and how is it
dierent from a two-way classication?
12. What is the postulated model and what are the hypotheses for a one-way classication experiment?
13. What is the completely randomized design, and why is it appropriate for the
one-way classication experiment?
14. What makes a one-way classication experimental design balanced as opposed
to unbalanced?
15. What is the ANOVA identity, and what is its primary implication in data analysis?
Design of Experiments
841
16. What is the dierence between a xed eect and a random eect ANOVA?
17. What is the Kruskal-Wallis test?
18. What is the randomized complete block design?
19. What is the postulated model and what are the hypotheses for the randomized
complete block design?
20. What is a nuisance variable?
21. What is the dierence between a 2-way classication of a single factor experiment and a two-factor experiment?
22. What is the postulated model and what are the hypotheses for a two-factor
experiment?
23. What is the potential problem with general multi-level multi-factor experiments?
24. What is a 2k factorial experiment?
25. What are some of the advantages and disadvantages of 2k factorial designs?
26. What does it mean that 2k factorial designs are balanced and orthogonal?
27. What is the general procedure for carrying out 2k factorial experiments?
28. 2k factorial designs are best applied to what kinds of problems?
29. What are screening designs and what is the rationale behind them?
30. In fractional factorial designs, what is aliasing, and what is an alias structure?
31. What is a dening relation?
32. What is the resolution of a fractional factorial design?
33. What are the main characteristics of a Resolution III, a Resolution IV, and a
Resolution V design?
34. If one is interested in estimating both main eects and 2-way interactions, why
should one not use a Resolution III design?
35. What is projection, and under what condition is it possible?
36. What is folding?
842
Random Phenomena
37. What problem with fractional factorial design is ameliorated with PlackettBurman designs?
38. What is the resolution of all Plackett-Burman designs?
39. What are Plackett-Burman designs best used for?
40. What are response surface designs and what distinguishes them from basic 2k
factorial designs?
41. What is a typical response surface design model for a two-factor experiment?
42. What is the dierence between a face centered cube design and a Box-Behnken
design?
43. When is a Box-Behnken design to be preferred over a face centered cube design?
44. What is an optimal experimental design?
45. What is the Fisher information matrix for a linear regression model?
46. What optimization criteria lead respectively to D-Optimal, A-Optimal, EOptimal, G-Optimal and V-Optimal designs?
EXERCISES
19.1 In each of the following, identify the response, the factors, the levels, and the
total number of treatments. Also identify which variables are categorical and which
are quantitative.
(i) A chemical engineering catalysis researcher is interested in the eect of N O concentration, (3500 ppm, 8650 ppm) ; O2 concentration (4%, 8%); CO concentration,
(3.5%, 5.5%); Space velocity, (30,000, 42,500) mL/hr/gcat ; Temperature, (548 K,
648 K); SO2 concentration, (0 ppm, 300ppm); and Catalyst metal type, (Pt, Ba,
Fe), on saturation N Ox storage ( mol).
(ii) A material scientist studying a reactive extrusion process is interested in the
eect of screw speed, (135 rpm, 150 rpm), feed-rate, (15 lb/hr, 25 lb/hr), and feedcomposition, %A, (25, 30, 45) on the residence time distribution, f (t).
(iii) A management consultant is interested in the risk-taking propensity of three
types of managers: entrepreneurs, newly-hired managers and newly promoted managers.
(iv) A child psychologist is interested in the eect of socio-economic status of parents
(Lower class, Middle class, Upper class), Family size (Small, Large) and Mothers
marital status (Single-never married, Married, Divorced), on the IQ of 5-year-olds.
Design of Experiments
843
19.2 Consider the postulated model for the single-factor, completely randomized
experiment given in Eq (19.2):
Yij = j + ij ; i = 1, 2, . . . , nj ; j = 1, 2, . . . , k
where j is the mean associated with the j th treatment; furthermore, let the grand
mean of the complete data set be .
(i) If the j th treatment mean is expressed as:
j = + j ; j = 1, 2, . . . , k
so that j represents the j
th
j = 0
(19.52)
j=1
i=1
Level 1
1
2
3
Level 2
4
5
6
(19.53)
(19.54)
(19.55)
844
Random Phenomena
EY
ET + EE
Take sums-of-squares in this equation and show that the result is the following sums
of squares identity:
nj
k
k
(Yij Y.. )2
SSY
SST + SSE
j=1 i=1
(Y.j Y.. )2 +
j=1
nj
k
(Yij Y.j )2
j=1 i=1
B1
B2
B3
B4
B5
F1
14.5
10.6
12.0
9.0
11.6
F2
11.5
10.2
9.1
8.8
10.6
F3
12.7
13.6
14.6
12.2
15.3
F4
10.9
10.8
9.7
8.7
10.0
(i) First analyze the data as a one-way classication with ve replicates and
comment on the results, especially the p-value associated with the ANOVA F -test.
2
values. What do these values indicate about how much of the
Note the R2 and Radj
variation in the data has been explained by the one-way classication model?
(ii) Now analyze the data as a two-way classication, where the blocks are now explicitly recognized as such. Compare the results with those obtained in (i) above.
Comment on what this exercise indicates about what can happen when a nuisance
eect is not explicitly separated out from an analysis of a one-factor experiment.
19.6 Refer to Exercise 19.5 and the supplied data. Repeat the analysis, this time
saving the residuals in each case (one-way rst, and then two-way next). Carry out a
normality test on both sets of residuals, plot both residuals on the same graph, and
compare their standard deviations. Comment on what these residuals imply about
which ANOVA model more appropriately ts the data, especially in light of what is
known about how this particular data set was generated.
Design of Experiments
845
19.7 Refer to Example 19.2 in the text. Repeat the data analysis in the example and
save the residuals for analysis. Asses the normality of the residuals. Interpret the
results and discuss what they imply about the ANOVA model for the Tire wear data.
19.8 Write out the model for a 24 factorial design where the factors are x1 , x2 , x3
and x4 and the response is y.
(i) How many parameters are to be estimated in order to specify this model completely?
(ii) Obtain a base design for this experiment.
19.9 For each of the following base factorial designs, specify the number of replicates
required.
(i) 23 with signal-to-noise ratio specied as SN = 1.5
(ii) 22 with signal-to-noise ratio specied as SN = 1.5
(iii) 22 with signal-to-noise ratio specied as SN = 2.0
(iv) 24 with signal-to-noise ratio specied as SN = 1.5
19.10 A factorial experiment is to be designed to study the eect on reaction yield
of temperature x1 at 180 C and 240 C in conjunction with Pressure at 1 atm and
2 atm. Obtain a design in terms of the original variables with three full replicates.
19.11 The design matrix for a 22 factorial design for factors X1 and X2 is shown
below along with the measured responses yi : i = 1 4.
Run
1
2
3
4
X1
1
1
1
1
X2
1
1
1
1
Response
y1
y2
y3
y4
0
1
y1
1 2
y2
y3 = X 2 + 3
y4
12
4
i.e., as
y = X +
(19.57)
From the design table given above, determine the matrix X in terms of 1 and 1 .
(ii) From Chapter 16, we know that the least squares estimate of the unknown
parameters in Eq (19.57) is:
1
= XT X
XT y
846
Random Phenomena
Show that for this particular 22 factorial design model, the least squares solution,
is given as follows:
,
12
1
(y1 + y2 + y3 + y4 )
4
1
(y1 + y2 y3 + y4 )
4
1
(y1 y2 + y3 + y4 )
4
1
(y1 y2 y3 + y4 )
4
(iii) Examine the design table shown above and explicitly associate the elements of
this design table directly with the least squares solution in (ii); identify how such a
solution can be obtained directly from the table without necessarily going through
the least squares computation.
19.12 The following data was obtained from a 22 factorial experiment for factors A
and B with two replicate runs. Analyze the data and estimate the main eects and
the interaction term.
Run
1
2
3
4
A
1
1
1
1
B
1
1
1
1
Response
Replicate 1 Replicate 2
0.68
0.65
3.81
4.03
1.67
1.71
8.90
9.66
19.13 Generate a design table for a 24 factorial experiment for factors A, B, C and
D, to be used as a base design for a 251 half factorial design which now includes a
fth factor E.
(i) Use ABC = E to generate the 251 design and show the resulting table. Obtain
the alias structure for the remaining 3-factor interactions. What is the resolution of
this design? Which main eect can be estimated with no confounding?
(ii) This time, use BCD = E to generate the 251 design and repeat (i).
19.14 The following table shows the result of a full 32-run, 25 factorial experiment
involving 5 factors A, B, C, D and E.
Design of Experiments
Data
Run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Table
A
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
847
(i) Estimate all the main eects and interactions. (Of course, you should use a computer program.)
(ii) Since no replicates are provided, use the normal probability plot and Lenths
method (which should be available in your computer program) to conrm that only
the main eects, A, B, and C, and the two-way interactions, AB and AC, are signicant.
(iii) In light of (ii) project the data down appropriately and reanalyze the data.
This time note the uncertainty estimates associated with the estimates of the main
eects and interactions; note also the various associated R2 values and comment on
the t of the reduced model to the data.
19.15 Refer to Exercise 19.14 and the accompanying data table. This time create
a 251 fractional factorial design and pretend that the experimenter had used this
fractional factorial design instead of the full one. Extract from the full 32-run data
table the results corresponding to the 16 runs indicated in the created 251 design.
(i) Determine the design resolution and the alias structure.
(ii) Repeat the entire Exercise 19.14 for only the indicated 16 experimental re-
848
Random Phenomena
sults. Compare the results of the analysis with that in Exercise 19.14. Does the
experimenter lose anything signicant by running only half of the full 25 factorial
experiments? Discuss what this particular example illustrates regarding the use of
fractional factorial designs when the experimental study involves many factors.
19.16 Refer to Exercise 19.14 and the accompanying data table. Create an 8-run,
252 fractional factorial design and, again as in Exercise 19.15, pretend that the experimenter has used this design instead of the full one. Extract from the full 32-run
data table, the results corresponding to the 8 runs indicated in the 252 design.
(i) Determine the design resolution and alias structure. Can any two-way interactions be determined independently without confounding? What does this imply in
terms of what can be truly estimated from this much reduced set of results?
(ii) Repeat the entire Exercise 19.14. How do the estimates of the main eects compare to the ones obtained from the full data set?
(iii) If the experimenter had only been interested in determining which of the
5 factors has a signicant eect on the response, comment on the advantages/disadvantages of the quarter fraction, 8-run design versus the full 32-run experimental design.
19.17 Consider an experiment to determine the eect of two factors, x1 and x2 , on
the response Y . Further, consider that the postulated model is:
y = 0 + 1 x1 + 2 x2 + 12 x1 x2 +
Let the minimum allowable values for each factors be represented in coded form
as 1 and the maximum allowable values as 1. In this case, the 22 factorial design
recommends the following settings for the factors:
X1
1
1
1
1
X2
1
1
1
1
(i) Show that for this 22 design, if the model is written in the matrix form:
y = X +
the Fisher information matrix (FIM) is given by
4 0
0 4
T
FI = X X =
0 0
0 0
0
0
4
0
0
0
0
4
Design of Experiments
X1
849
X2
where 0 < < 1, show that the determinant of the FIM will be
T
2 4
X X = (4 )
Because 0 < < 1, this determinant will be signicantly less than the determinant
of the FIM for the factorial design. (Since it can be shown that any other selection
of four non-orthogonal points in the region bounded by the rectangle in <
x1 < ; < x2 < in the x1 -x2 space will lead to even smaller determinants
for the resulting FIM, this implies that for the given two-factor experiment, and
the postulated model shown above, the 22 factorial design maximizes the FIM and
hence is the D-Optimal design for this problem.)
APPLICATION PROBLEMS
19.18 The following table of data from Nelson (1989)14 shows the cold cranking
power of ve dierent battery types quantied as the number of seconds that a
particular battery generated its rated amperage without falling below 7.2 volts, at
0 F. The experiment was replicated four times for each battery type.
Battery Type
Experiment No
1
2
3
4
41
43
42
46
42
43
46
38
27
26
28
27
48
45
51
46
28
32
37
25
When presented in Problem 12.22, the objective then was to identify any suggestion
of descriptive (as opposed to inductive) evidence in the data set to support the
postulate that some battery types are better than others.
(i) Now, use a one-way classication ANOVA to determine inductively if there is a
dierence in the cold cranking power of these battery types. What assumptions
are required for this to be a valid test? Are these assumptions reasonable?
(ii) From a box plot of the data set, which battery types appear to be dierent from
the others?
19.19 Refer to Problem 19.18 and the data table. Use a computer program to carry
out a nonparametric Kruskal-Wallis test. Interpret your result. What conclusion
does this result lead to regarding the equality of the cold cranking power of these
battery types? Is this conclusion dierent from the one reached in Problem 19.18
14 Nelson,
850
Random Phenomena
L1
23
27
26
19
30
Litter Number
L2 L3 L4
29 38 30
25 31 27
33 28 28
36 35 22
32 33 33
28 36 34
30 34 29
31 32 30
L5
31
33
31
28
30
24
(i) At the = 0.05 signicance level, test the hypothesis that the litter from which
a pig is drawn has no eect on the weight gain. What is the conclusion of this test?
(ii) Had the test been conducted at the = 0.01 signicance level, what conclusions
would you have reached? Examine a box plot of the data and comment on what it
suggests about the hypothesis.
19.21 The table below shows the time in months between occurrences of safety violations for three operators, A, B, and C, working in a toll manufacturing
facility.
A
B
C
1.31
1.94
0.79
0.15
3.21
1.22
3.02
2.91
0.65
3.17
1.66
3.90
4.84
1.51
0.18
0.71
0.30
0.57
0.70
0.05
7.26
1.41
1.62
0.43
2.68
6.75
0.96
0.68
1.29
3.76
The data is clearly not normally distributed since the phenomenon in question is
typical of an exponential random variable. Carry about an appropriate test to validate the hypothesis that there is a dierence between the safety performance of
these operators. Justify your test choice and interpret your results adequately.
19.22 The following table, adapted from Gilbert (1973)16 , shows the sprinting speeds
(in feet/second) for three types of fast animals, classied by sex (male or female).
Analyze the data appropriately. From your analysis, what can you conclude about
the eect of animal type and sex on speed?
15 P.G. Moore, A.C. Shirley, and D.E. Edwards, (1972). Standard Statistical Calculations,
Pitman, Bath.
16 Gilbert, (1973). Biometrical Interpretation, Clarendon Press, Oxford.
Design of Experiments
851
Low
0%
8%
0%
High
6%
16%
3%
Generate a standard (un-replicated) design and analyze the following results for
the response y = grid line width (in coded units) obtained in the series of 8 experiments run in randomized order, but which, for the purpose of this problem, have
been rearranged in the standard order as follows: 6.6, 7.0, 9.7, 9.4, 7.2, 7.6, 10.0, and
9.8. Which eects are signicant?
19.24 Box and Bisgaard, (1987)17 , present an experimental study of a process for
manufacturing carbon-steel springs. The study was designed to identify process operating conditions that will minimize or possibly eliminate the cracks that have plagued
the manufactured springs. From physics, material science, and process knowledge,
the following factors and levels were chosen for the investigation, along with a 23
factorial design with no replication, for a grand total of 8 experiments. The response
of interest, y, is the proportion (in %) of the manufactured batch of springs that do
not crack.
Factor
x1 : Steel Temperature ( F )
x2 : Carbon Content (% )
x3 : Quench Oil Temp. ( F )
Low
1450
0.5
70
High
1600
0.7
120
The result of the study is shown in the data table below, presented in standard order
(the actual experiments were run in random order).
(i) Analyze the experimental results to determine which factors and interactions are
signicant. State clearly how you were able to determine signicance, complete with
standard errors of the estimated signicant main eects and interactions.
(ii) Assess the model validity.
(iii) Interpret the results of this analysis in terms of what should be done if the
objective is to increase the percentage of manufactured springs that do not crack.
17 Box, G.E.P. and S. Bisgaard, (1987). The scientic context of quality improvement,
Quality Progress, 22 (6) 5461.
852
Random Phenomena
Run
1
2
3
4
5
6
7
8
x1
1
1
1
1
1
1
1
1
x2
1
1
1
1
1
1
1
1
x3
1
1
1
1
1
1
1
1
y
67
79
61
75
59
90
52
87
19.25 Palamakula, et al. (2004)18 , carried out a study to determine a model to use
in optimizing capsule dosage for a highly lipophilic compound, Coenzyme Q10. The
study involved 3 factors, Limonene, Cremophor, and Capmul, in a Box-Behnken
design; the primary response was the cumulative percentage of drug released after
5 minutes (even though the paper reported 4 other responses which served merely
as constraints). The low and high settings for each variable are shown below (the
middle setting is exactly halfway in between).
Factor
Limonene (mg)
Cremophor EL (mg)
Capnmul GMO50 (mg)
Low
18
7.2
1.8
High
81
57.6
12.6
The results of interest (presented in standard order) are shown in the table below.
Generate a design, analyze the results and decide on a model that contains only
statistically signicant parameters. Justify your decision. After you have completed
your analysis, compare your ndings with those reported in the paper.
Run
Std
Order
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Response
y SD
44.4 29.8
6 9.82
3.75 6.5
1.82 1.07
18.2 8.73
57.8 9.69
68.4 1.65
3.95 3.63
58.4 2.56
24.8 5.80
1.60 1.49
12.1 0.84
81.2 9.90
72.1 7.32
82.06 10.2
19.26 The following data set is from an experimental study reported in Garge,
18 Palamakula, A., M.T.H. Nutan and M. A. Khan (2004). Response Surface Methodology for optimization and characterization of limonene-based Coenzyme Q-10 selfnanoemulsied capsule dosage form. AAPS PharmSciTech, 5 (4), Article 66. (Available at
http://www.aapspharmscitech.org/articles/pt0504/pt050466/pt050466.pdf.)
Design of Experiments
853
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Melting
Zone
Temp
( C)
135
135
135
150
135
150
135
135
150
150
150
135
150
135
150
150
Mixing
Zone
Temp
( C)
135
135
135
150
150
150
150
135
150
135
135
150
135
150
135
150
Inlet
composition
(% E)
Screw
Speed
(RPM)
Feed
Rate
(lb/h)
Pulse
Composition
(% E)
5
5
0
5
0
0
0
0
5
5
0
5
5
5
0
0
150
250
250
250
150
250
250
150
150
150
250
150
250
250
150
150
25
25
15
25
25
15
25
15
25
15
25
15
15
15
25
15
100
10
100
100
10
10
100
10
10
100
100
10
100
100
10
100
The results of the experiments (run in random order, but presented in standard
order) are shown in the table below. For each response, analyze the data and determine which factors are signicant. Comment on the model t.
19 S. Garge, (2007). Development of an inference-based control scheme for reactive extrusion processes, PhD Dissertation, University of Delaware.
854
Random Phenomena
Run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Melting
Energy (J)
1700
1550
1300
800
800
650
650
650
1100
1000
1650
650
1100
650
1300
950
Reaction
Energy (J)
1000
300
3700
800
100
0
800
0
0
600
650
100
2100
1400
0
2150
Chapter 20
Application Case Studies III:
Statistics
20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.2 Prussian Army Death-by-Horse kicks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.2.1 Background and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.2.2 Parameter Estimation and Model Validation . . . . . . . . . . . . . . . . . . . . .
20.2.3 Recursive Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Motivation, Background and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Theory: The Bayesian MAP Estimate . . . . . . . . . . . . . . . . . . . . . . . . . .
Application: Recursive Bayesian Estimation Formula . . . . . . . . . .
Application: Recursive Bayesian Estimation Results . . . . . . . . . . .
Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.3 WW II Aerial Bombardment of London . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.4 US Population Dynamics: 1790-2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.4.1 Background and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.4.2 Truncated Data Modeling and Evaluation . . . . . . . . . . . . . . . . . . . . .
20.4.3 Full Data Set Modeling and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .
Future Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.4.4 Hypothesis Testing Concerning Average Population Growth Rate
20.5 Process Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.5.1 Problem Denition and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.5.2 Experimental Strategy and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.5.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conrmation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PROJECT ASSIGNMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. Eect of Bayesian Prior Distributions on Estimation . . . . . . .
2. First Principles Population Dynamics Modeling . . . . . . . . . .
3. Experimental Design and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
4. Process Development and Optimization . . . . . . . . . . . . . . . . . . . . .
856
857
857
858
860
860
862
864
866
867
868
870
870
872
873
874
876
878
879
879
879
879
880
883
889
889
890
890
890
891
891
856
Random Phenomena
for how best to approach the problem; and what works in one case may not
necessarily work in another. But even art has its foundational principles, and
each art form its own peculiar set of tools. Whether it is capturing the detail
in a charcoal-on-paper portrait of a familiar face, the deep perspectives in an
oil-on-canvas painting of a vast landscape, or the rugged three-dimensional
contours of a sculpture, the problem at hand is what ultimately recommends
the tools. Such is the case with the three categories of problems selected for
discussion in this nal chapter in the trilogy of case studies. They have been
selected to demonstrate the broad range of applications of the theory and
principles of the past few chapters.
The rst problem is actually a pair of distinct but similarly structured
problems, staples of introductory probability and statistics courses that are
frequently used to demonstrate the powers of the Poisson distribution in
capturing the elusive character of rare events. The rst, the famous von
Bortkiewicz data set on death-by-horse kicks in the 19th century Prussian
army, involves an after-the-fact analysis of unusual death; the second, an
analysis of the aerial bombardment of London in the middle of World War
II, is marked by a cannot-waste-a-minute urgency of the need to protect
the living. One is pure analysis (to which we have added a bit of a wrinkle);
the other is a brilliant use of analysis and hypothesis testing for practical
life-and-death decision making.
The second problem involves the complete 21 decades of US census data
from 1790 to 2000. By the standards of sheer volume, this is a comparatively
modest data set (especially when compared to the data sets in the rst problem
that are at least an order of magnitude larger). For a data set that could be
analyzed in an almost limitless number of ways, it is interesting, as we show
here, how a simple regression analysis, an examination of the residuals, and
other such investigations can provide glimpses (smudges?) of the ngerprints
left by history on a humble data set consisting of only 22 entries.
The third problem is more prosaic, coming from the no-nonsense world
of industrial manufacturing. It involves process optimization, using strategic
experimental designs and data analysis to meet dicult business objectives.
But what this problem lacks in the macabre drama of the rst, or in the rich
history of the second, it tries to make up for in hard-headed practicality.
Which of these three sets of problems is the charcoal-on-paper portrait,
the oil-on-canvas landscape, or the rugged three-dimensional sculpture is left
to the readers imagination.
20.1
Introduction
Having completed our intended course of discussion on descriptive, inferential, and experimental design aspects of statistics, the primary objective of
857
this chapter is to present a few real-life problems to demonstrate how the concepts and ideas discussed in the preceeding chapters have been (and continue
to be) used to nd appropriate solutions to important problems. The following
is a brief catalog of the problems selected for discussion in this chapter, along
with what aspects of statistics they illustrate:
1. The Prussian army death-by-horse kicks problem involves the analysis
of the truly rare events implied in the title. It illustrates probability
modeling, the characterization of a population using the concepts of
sampling and estimation, and illustrates probability model validation.
Because the data set is a 20-year record of events happening year-byyear, we add a wrinkle to this famous problem by investigating what
could have happened had recursive Bayesian estimation been used to
analyze the data, not all at once at the end of the 20-year period, but
year-to-year. The question what is such a model useful for? is answered
by a complementary problem, similar in structure but dierent in detail.
2. The Aerial Bombardment of London in World War II problem provides
an answer to the question of practicality raised by the Poisson modeling and analysis of the Prussian army data. This latter problem picks
up where former one left o, by demonstrating how a Poisson model
and an appropriately framed hypothesis test provided the basis for the
strategic deployment of scarce anti-aircraft resources. In the historical
context of the time, solving this latter problem was anything but an act
of mere intellectual curiosity.
3. The US Population dynamics problem illustrates the power of simple
regression modeling, and how it can be used judiciously to make predictions. It also illustrates how data analysis can be used almost forensically
to nd hidden clues in data sets.
4. The Process Optimization problem illustrates the use of design of experiments (especially response surface designs) to nd optimum conditions
for achieving manufacturing objectives in an industrial process.
20.2
20.2.1
von Bortkiewicz, (1898). Das Gesetz der Kleinen Zahlen, Lipzig, Teubner.
858
Random Phenomena
TABLE 20.1:
Frequency distribution of
Prussian army deaths by horse kicks
No of Deaths Number of occurrences
x
of x deaths per unit-year
(Total Frequency)
0
109
1
65
2
22
3
3
4
1
Total
200
x e
x!
(20.1)
where the parameter is the mean number of deaths recorded per unit-year.
Observe that the phenomena underlying this problem t those stated for
the Poisson random variable in Chapter 8: the variable of interest is the total
number of occurrences of a rare event; the events are occurring in a xed
interval of time (and location); and they are assumed to occur at a uniform
average rate. The primary purpose here is twofold: to characterize this random
variable, X, by determining the dening population parameter (in this case,
the Poisson mean number of deaths per unit-year), and to conrm that the
model is appropriate.
The data shown above is clearly from an observational study. No one
designed the experiment, per se (how could one?); the deaths were simply
recorded each year for each cavalry unit as they occurred. Nevertheless, there
is no evidence to suggest that the horses and their victims came in contact in
any other way than randomly. It is therefore reasonable to consider the data
as a random sample from this population of 19th century cavalry units.
859
TABLE 20.2:
20.2.2
If the data set is considered a random sample, then the maximum likelihood
estimate of the parameter, , is the sample average, which is obtained from
the data presented in Table 20.1 as:
x
=
=
(20.2)
(20.3)
and with sample variance s = 0.611, the standard error of the estimate is
SE() = 0.041
(20.4)
(20.5)
From the point estimate, the predicted relative frequency distribution, f(x),
is therefore obtained according to:
0.61xe0.61
f(x) = f (x| = 0.61) =
x!
(20.6)
860
Random Phenomena
(20.7)
we have no evidence to support rejecting the null hypothesis (at the signicance level of 0.05) and therefore conclude that the model provides an
adequate t to the data. Note that the last two frequency groups corresponding to X = 3 and X = 4 had to be combined for the chi-squared test; and
even then, the expected frequency, nf = 4.823 fell just short of the required
5. MINITAB identies this and prints out a warning. However, this is not
enough to invalidate the test.
A graphical representation of the observed versus predicted frequency,
along with a bar graph of the individual contributions to the chi-squared
statistic from each frequency group is shown in Fig 20.1.
20.2.3
861
Expected
Observ ed
100
Value
80
60
40
20
0
x
>=3
Contributed Value
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
>=3
FIGURE 20.1: Chi-Squared test results for Prussian army death by horse kicks data
and a postulated Poisson model. Top panel: Bar chart of Expected and Observed
frequencies; Bottom Panel: Bar chart of contributions to the Chi-squared statistic.
862
Random Phenomena
TABLE 20.3:
breakdown
Unit
Year
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Year-by-Year, Unit-by-Unit
of Prussian army deaths data
1 2 3 4 5 6 6 8 9 10
1
0
0
0
0
0
1
0
1
0
1
1
1
0
1
1
0
0
2
0
0
1
1
1
0
0
1
1
0
0
1
0
0
0
1
0
1
1
0
0
2
0
0
3
2
2
1
0
1
2
0
1
0
0
0
2
0
2
1
0
0
0
0
1
0
0
1
2
0
0
3
1
0
0
1
1
2
2
0
1
0
1
2
0
1
1
0
0
1
1
0
0
2
0
0
0
1
0
0
0
0
0
0
0
4
0
1
0
0
0
2
0
0
0
0
1
0
1
2
0
1
0
1
1
0
2
1
1
3
0
1
0
1
0
0
1
2
0
1
0
0
1
2
0
2
1
0
0
0
1
0
0
1
1
2
1
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
2
2
1
0
0
1
0
0
1
0
0
0
1
0
1
1
1
1
0
0
0
2
0
0
1
1
0
1
0
Total
4
3
6
7
10
7
7
5
8
4
8
5
9
2
5
8
8
6
7
3
863
1
a1 e/b
ba (a)
(20.8)
we can then obtain the posterior distribution, f (|x1 , x2 , . . . , xn ), after combining this prior with the sampling distribution for a random sample, drawn
from the Poisson distribution. From this posterior distribution, we will use
the maximum `
a-posteriori (MAP) estimate as our choice for the parameter
estimate.
Now, the sampling distribution for a random sample, X1 , X2 , . . . , Xn , from
the Poisson distribution is given by:
n
( i=1 xi ) en
f (x1 , x2 , . . . , xn |) =
x1 !x2 ! . . . xn !
(20.9)
By Bayes theorem, using the given prior, the posterior distribution is:
f (|x1 , x2 , . . . , xn ) = C
1
(a1+ ni=1 xi ) e(n+1/b) (20.10)
ba (a)x1 !x2 ! . . . xn !
b
i=1
which, upon equating to zero and solving, gives the result:
n
( i=1 xi ) + (a 1)
MAP =
n + 1b
(20.12)
Of course, depending on the value specied for a and b in the prior distribution, Eq (20.12) will return dierent estimates for the unknown parameter.
In particular, observe that for a = 1 and b = , the result is precisely the
same as the sample average. This is one of the main criticisms of Bayesian estimation: that the subjectivity inherent in the choice of the prior distribution
introduces bias into the estimate.
However, it can be shown that this bias is traded o for a smaller parameter
estimate variance. Furthermore, when used recursively, whereby the posterior
distribution at the current time is used as the prior distribution in the next
round, the bias introduced at the beginning by the original prior distribution
progressively washes out with each iteration. Thus, in return for the initial
bias, this recursive strategy allows one to carry out estimation sequentially
without having to wait for the full data set to be completely accumulated,
obtaining more reliable estimates along the way.
864
Random Phenomena
0.8
0.7
0.6
fT
0.5
0.4
Gamma(2,0.5)
0.3
0.2
0.1
0.0
FIGURE 20.2: Initial prior distribution, a Gamma (2,0.5), used to obtain a Bayesian
estimate for the Poisson mean number of deaths per unit-year parameter.
865
(20.13)
1
10
10k+10
xi
(20.15)
i=10k+1
(10k + b )(k) =
10k
xi
+a1
(20.18)
i=1
10k+10
xi
(20.19)
i=10k+1
1
10
10k+10
i=10k+1
xi
(20.20)
866
Random Phenomena
TABLE 20.4:
Recursive
(yearly) Bayesian estimates of the
mean number of deaths per
unit-year
After
After
year j
(j)
year j
(j)
1
0.4167
11
0.6250
12
0.6148
2
0.3636
3
0.4375
13
0.6364
4
0.5000
14
0.6056
15
0.5987
5
0.5962
6
0.6129
16
0.6111
7
0.6250
17
0.6221
18
0.6209
8
0.6098
19
0.6250
9
0.6304
10
0.6078
20
0.6089
(k+1) = (k) +
10k + 10 + b
(20.21)
(This expression is reminiscent of the expression for the recursive least squares
estimate given in Eq (16.190) in Chapter 16.)
Application: Recursive Bayesian Estimation Results
Beginning with (1) = 0.4167 obtained above, the recursive expression in
Eq (20.14) may now be used along with the supplied data; the result is the
sequence of MAP estimates, (j) , after each j th year, shown in Table 20.4 for
j = 2, 3, . . . , 20.
A plot of these recursive estimates is shown in Fig 20.3 in the solid line,
with the dashed line representing the single estimate (0.61) obtained earlier
using the entire 20-year data all at once. Observe now that an argument could
have been made for stopping the study sometime after the 5th or 6th year since,
by then, the recursive estimate has essentially and observably settled down
to within a small tolerance of the nal value of 0.61. To be sure that enough
time would have elapsed to ascertain that the value has truly settled down,
the year 7 or 8 or even up to year 10 would all be good recommendations
for the stopping year. The point is that this recursive method would have
provided a stable estimate close to the nal one long before the 20 years have
elapsed. While the observed convergence to the maximum likelihood estimate
demonstrates the washing out of the initial prior distribution, it will still
be interesting to determine the nal posterior distribution and compare it to
the original prior distribution.
It can be shown (and this is left as an exercise to the reader) that the nal
867
Variable
Thetahat_j
Theta_MLE
0.60
(j)
0.55
0.50
0.45
0.40
0.35
0
10
15
Year, j (from 1874)
20
FIGURE 20.3: Recursive Bayesian estimates using yearly data sequentially, compared
with the standard maximum likelihood estimate, 0.61, (dashed-line).
posterior distribution is:
f (|x1 , x2 , . . . , xn ) = C1 123 e202
(20.22)
202124
(124)
(20.23)
A plot of this distribution along with the prior gamma (2, 0.5) distribution is
shown in Fig. 20.4. The key characteristic of this posterior distribution is that
it is very sharply concentrated around its mean value of 0.614, and mode of
0.619, both of which are practically the same as the maximum likelihood value
of 0.61 obtained earlier. This posterior distribution stands in very sharp contrast to the very broad prior distribution, demonstrating how the information
obtained recursively from the accumulating data had reduced the uncertainty
expressed in the initial prior distribution.
Final Remarks
The reader will be forgiven for asking: what is the point of this entire
exercise? After all, the soldiers whose somewhat undignied deaths (undignied, that is, for professional soldiers) have contributed to the data will not
be coming back as a result of this exercise. Furthermore, it is not as if (at
least as far as we know) the original analysis led to a change in the Prussian
army policy that yielded improved protection for the soldiers against such
deaths. A Poisson model t the data well; the model parameter has been
868
Random Phenomena
8
7
Final Posterior,
Gamma(124,0.005)
fT
5
4
3
2
1
Prior, Gamma(2,0.5)
0
FIGURE 20.4: Final posterior distribution (dashed line) along with initial prior distribution (solid line).
20.3
At a critical point in World War II, after London had suered a devastating
period of sustained aerial bombardment, the British government had to make
some strategic decisions about where to deploy limited anti-aircraft artillery
for maximum eectiveness. How British researchers approached and solved
this problem provides yet another example of how random phenomena are
subject to rigorous analysis and illustrates how probability and statistics can
be used eectively for solving important real-life problems.
In his 1946 paper2 about this problem, R. D. Clarke noted:
2 Clarke, R. D. (1946). An application of the Poisson Distribution, J Inst. Actuaries,
72, 48-52.
869
TABLE 20.5:
= 0.932
(20.24)
x e
f(x) =
x!
(20.25)
870
Random Phenomena
This is where the validated model becomes useful, since we can use it to
compute this particular probability as
P (X 7| = 0.93) = 0.000054
(20.26)
We have retained all these decimal places to make a point. Cast in the form of
a hypothesis test, this statement declares that at the 0.05 signicance level, for
any region to receive 7 or more hits is either the absolute rarest of rarest events,
(about one in 20,000 tries) or the region in fact must have been deliberately
targeted. The anti-aircraft artillery were therefore deployed in this location,
and the rest is history.
20.4
20.4.1
(20.27)
TABLE 20.6:
US Population (to
the nearest million) from 17902000
Census Year Population (millions)
1790
4
1800
5
1810
7
1820
10
1830
13
1840
17
1850
23
1860
31
1870
40
1880
50
1890
63
1900
76
1910
92
1920
106
1930
123
1940
132
1950
151
1960
179
1970
203
1980
227
1990
249
2000
281
871
872
Random Phenomena
20.4.2
(20.28)
which essentially assigns the natural numbers 1, 2, etc to the census years, so
that the rst year, 1790 is 1, the second, 1800 is 2, etc. A regression analysis
of the truncated data (only up to x = 19, or 1970) in MINITAB produces the
following results. The regression model itself is:
yt = 6.14 1.86x + 0.633x2
(20.29)
873
20.4.3
Repeating the entire exercise, this time using the full census data set,
produces the following result. This time, the model is,
y = 7.92 2.47x + 0.668x2
(20.30)
where we note that the model coecients have not changed too drastically.
The rest of the MINITAB output is shown below.
Regression Analysis: Population versus Xn, Xn2
The regression equation is
Population = 7.92 - 2.47 Xn + 0.668 Xn2
Predictor
Coef
SE Coef
T
P
Constant
7.916
2.101 3.77 0.001
Xt
-2.4735
0.4209 -5.88 0.000
Xt2
0.66763 0.01777 37.57 0.000
S = 2.99130 R-Sq = 99.9% R-Sq(adj) = 99.9%
Once again, the parameter estimates are seen to be signicant; and the R2
2
and Radj
values have even improved slightly. The ANOVA table (not shown)
does not show anything out of the ordinary. A plot of the data, the regression
model t, along with both the 95% condence interval and the 95% prediction
interval, is shown in Fig 20.5.
This gure seems to show that the t is particularly good, with very little
uncertainty around the model prediction, as implied by the very tight condence and prediction intervals. However, MINITAB agged two residuals that
874
Random Phenomena
Fitted Line Plot
Population = 7.916 - 2.474 XNormYear
+ 0.6676 XNormYear**2
300
Regression
95% CI
95% PI
250
S
R-Sq
R-Sq(adj)
Population
200
2.99130
99.9%
99.9%
150
100
50
0
0
10
15
XNormYear
20
25
FIGURE 20.5: Quadratic regression model t to US Population data along with both
the 95% condence interval and the 95% prediction interval.
875
Versus Order
(response is Population)
1940 1950
Standardized Residual
2
1
-1
-2
-3
2
10
12
14
Observation Order
16
18
20
22
95
90
Percent
80
70
60
50
40
30
20
10
5
-3
-2
-1
0
1
Standardized Residual
876
Random Phenomena
(20.31)
a point estimate of 304 million along with the indicated 95% prediction interval. We personally believe that this probably underestimates what the true
2010 census result will be. The potential unaccounted-for phenomena that are
likely to aect this prediction include but are not limited to:
The increasingly complex immigration patterns over the past 20 years;
The changing mean rate of reproduction among recent immigrants and
among long-term citizens;
The inuence of medical advances on life expectancy.
The reader is encouraged to think of any other potential factors that may
contribute to rendering the prediction inaccurate.
20.4.4
P (Y ) P (Y 10)
100%
P (Y 10)
(20.32)
877
TABLE 20.7:
Percent average
relative population growth rate for each
census year from 18002000 divided
into three 70-year periods
Census
Average Rel.
Year
GrowthRate (%) Period
1800
25.0000
1
1810
40.0000
1
1820
42.8571
1
1830
30.0000
1
1840
30.7692
1
1850
35.2941
1
1860
34.7826
1
1870
29.0323
2
1880
25.0000
2
1890
26.0000
2
1900
20.6349
2
1910
21.0526
2
1920
15.2174
2
1930
16.0377
2
1940
7.3171
3
1950
14.3939
3
1960
18.5430
3
1970
13.4078
3
1980
11.8227
3
1990
9.6916
3
2000
12.8514
3
1860
45
40
35
30
25
20
15
10
1800
1850
1900
Year
1950
2000
FIGURE 20.7: Percent average relative population growth rate in the US for each
census year from 1800-2000 divided into three equal 70-year periods. Period 1: 18001860; Period 2: 1870-1930; Period 3: 1940-2000.
878
Random Phenomena
Normal Probability Plot
(response is GrowthRate)
99
95
90
Percent
80
70
60
50
40
30
20
10
5
-10
-5
0
Residual
10
FIGURE 20.8: Normal probability plot for the residuals from the ANOVA model for
Percent average relative population growth rate versus Period with Period 1: 1800-1860;
Period 2: 1870-1930; Period 3: 1940-2000.
perhaps driven by the baby boom. The intermediate year 1945 (during WW
II) and the census year, 1960, are marked on the graph for reference.
A formal one-way ANOVA test of equality of the average relative growth
rates for these 3 periods shows the following not-too-surprising result:
Results for: USPOPULATION.MTW
One-way ANOVA: GrowthRate versus Period
Source DF
SS
MS
F
P
Period
2 1631.9 816.0 32.00 0.000
Error
18 458.9 25.5
Total
20 2090.9
20.5
879
Process Optimization
20.5.1
This nal problem involves improving the overall performance of a twostage commercial coating process.3 The primary material used in the coating
process is produced in the rst stage where manufacturing yield (measured
in %) is the key response variable of interest. In the second stage, other additives are compounded with the primary material and the coating process
completed; the primary response variable is adhesion, measured in grams.
To meet customer specications and remain protable requires yields of
91% or greater and adhesion greater than 45 grams. However, the process
consistently failed to meet these objectives, and an experimental program
was launched with the aim of nding process variable settings at which the
specied objectives would be met.
20.5.2
Planning
With the response variables identied as y1 , Yield (%), and y2 , Adhesion
(grams), a thorough consideration of all aspects of the process led to the
following list of seven potential factorsindependent process variables that
could potentially aect yield and adhesion:
1. Amount of additive
2. Raw material supplier
3. Reactor conguration
4. Reactor level
5. Reactor pressure
6. Reactor temperature
7. Reaction Time
As a result, the following overall strategy was devised: rst, a set of screening
experiments will be performed to determine which of these seven variables are
important factors; next, a set of optimization studies will be carried out to
determine the best settings for the important factors; and nally, the optimum
setting will be veried in a set of conrmation experiments.
These considerations led to the choice of a 273
IV fractional factorial design
for the screening, followed by a response surface design for optimization.
3 Adapted from an example used in an E.I. duPont de Nemours and Companys Central
Research & Development training course; the original source is unknown.
880
Random Phenomena
TABLE 20.8:
Time
0
70
35
0
70
35
0
35
35
35
35
35
35
70
35
0
70
35
0
70
Temperature Yield
20
20
40
60
60
20
40
40
40
40
40
40
40
40
60
20
20
40
60
60
100
100
100
100
100
140
140
140
140
140
140
140
140
140
140
180
180
180
180
180
68
51
75
81
65
80
68
82
87
87
82
85
85
75
92
40
75
77
50
90
Adhesion
3
40
31
10
48
38
24
37
41
40
40
42
42
44
41
37
31
44
40
39
20.5.3
Analysis
With the design and data in a MINITAB worksheet, the data analysis is carried out with the sequence: Stat > DOE > Response Surface >
881
Analyze Response Surface Design > which opens a self-explanatory dialog box. Upon selecting the appropriate options, the following results are obtained, rst for Yield:
Response Surface Regression: Yield versus Additive, Time, Temperature
The analysis was done using coded units.
Estimated Regression Coefficients for Yield
Term
Coef SE Coef
T
P
Constant
84.5455
0.6964 121.403 0.000
Additive
4.9000
0.6406
7.649 0.000
Time
6.4000
0.6406
9.991 0.000
Temperature
-0.8000
0.6406
-1.249 0.240
Additive*Additive
-12.8636
1.2216 -10.530 0.000
Time*Time
1.6364
1.2216
1.340 0.210
Temperature*Temperature -8.3636
1.2216
-6.847 0.000
Additive*Time
0.7500
0.7162
1.047 0.320
Additive*Temperature
13.5000
0.7162 18.849 0.000
Time*Temperature
-0.2500
0.7162
-0.349 0.734
S = 2.02574 PRESS = 184.619
R-Sq = 98.92% R-Sq(pred) = 95.13% R-Sq(adj) = 97.94%
Estimated Regression Coefficients for Yield using data in uncoded
units
Term
Coef
Constant
7.87273
Additive
-0.517792
Time
-0.00102273
Temperature
1.11864
Additive*Additive
-0.0105009
Time*Time
0.00409091
Temperature*Temperature -0.00522727
Additive*Time
0.00107143
Additive*Temperature
0.00964286
Time*Temperature
-3.12500E-04
(The ANOVA tablenot showndisplays the typical break down of the
sources of variability and establishes that the composite linear, square and
interaction terms are all signicant.)
The corresponding results for Adhesion are as follows:
Response Surface Regression: Adhesion versus Additive, Time,
Temperature
The analysis was done using coded units.
882
Random Phenomena
P
0.000
0.000
0.000
0.000
0.000
0.577
0.030
0.241
0.000
0.425
(20.33)
x2
x1
A 35
35
40
20
T 140
40
(20.35)
(20.36)
(20.37)
883
(20.38)
884
Random Phenomena
Versus Fits
(response is yield)
1.5
Standardized Residual
1.0
0.5
0.0
-0.5
-1.0
-1.5
40
50
60
70
Fitted Value
80
90
100
95
90
Percent
80
70
60
50
40
30
20
10
5
-2
-1
0
Standardized Residual
FIGURE 20.9: Standardized residual plots for Yield response surface model: versus
tted value, and normal probability plot.
885
Versus Fits
(response is Adhesion)
Standardized Residual
-1
-2
0
10
20
30
40
50
Fitted Value
95
90
Percent
80
70
60
50
40
30
20
10
5
-2
-1
0
Standardized Residual
FIGURE 20.10: Standardized residual plots for Adhesion response surface model:
versus tted value, and normal probability plot.
886
Random Phenomena
96
84
Y ield
72
60
20
40
A dditive
175
150
125 T emper atur e
60
100
140
y ield
< 60
70
80
90
91
92
93
94
> 94
130
Hold Values
Time 60
170
60
70
80
90
91
92
93
Temperature
160
150
120
110
100
10
20
30
40
Additive
50
60
70
FIGURE 20.11: Response surface and contour plots for Yield as a function of
Additive and Temperature (with Time held at 60.00).
887
50
A dhesion
30
175
10
150
0
125
20
40
A dditive
60
T emper atur e
100
130
Adhesion
< 20
20 30
30 40
40 41
41 42
42 43
43 44
44 45
45 46
46 47
47 48
> 48
120
Hold Values
Time 60
170
Temperature
160
150
140
110
100
10
20
30
40
Additive
50
60
70
FIGURE 20.12: Response surface and contour plots for Adhesion as a function of
Additive and Temperature (with Time held at 60.00).
888
Random Phenomena
y ield
91
100
A dditiv e = 48.3926
Temperature = 150.963
y ield = 93.3601
A dhesion = 45.5183
Adhesion
45
80
150
Hold Values
Time 60
140
130
120
110
100
10
20
30
40
Additive
50
60
70
FIGURE 20.13: Overlaid contours for Yield and Adhesion showing feasible region
for desired optimum. The planted ag indicates the optimum values of the responses
along with the corresponding setting of the factors Additive and Temperature (with
Time held at 60.00) that achieve this optimum.
optimum might lie. Observe that while the yield response shows the existence
of a maximum, the adhesion response shows a saddle point.
At this point, several options are available for determining the optimum
settings for these factors: the calculus method, by taking derivatives in Eqs
(20.33) and (20.34), (after setting x2 to its maximum value of 1) and solving
simultaneously for x1 , and x3 , subject to the constraints in the objectives; or
by using various graphically based options available in MINITAB. Since two
dierent responses are involved, to take advantage of the MINITAB options,
it is necessary to overlay the two contours for Yield and Adhesion to see the
region where the objectives are met simultaneously. The MINITAB contour
overlay option, when given the desired values of yield, 91 < y1 < 100, and
desired values for adhesion, 45 < y2 < 80, (the value of 80 is simply a high
enough upper limit) produces the overlaid contour plots shown in Fig 20.13; it
indicates the feasible region as the intersection of the two contours. MINITAB
has a built-in response optimizer that can be used to nd the optimum; it also
has a plant-the-ag option that allows the user to roam over the feasible
region in the overlaid contour plot with the computer mouse, while the values
of the responses at the particular location visited literally scroll by on the
screen. This is another option for nding the optimum.
While the reader is encouraged to explore all the other options, what we
show here in Fig 20.13 is the MINITAB ag planted at the optimum values
found by this exploratory plant-the-ag option. The optimum responses are:
y1 = 93.36%; y2 = 45.52
(20.40)
889
(20.41)
Conrmation
A nal conrmation set of 5 experimental runs, four 22 factorial experiments run in a small region around these optimum settings, 46 < Additive <
50; 140 < Temperature < 160, (with Time set at 60 mins), plus one at the
prescribed optimum itself, resulted in products all with acceptable yield and
adhesions. (Results not shown).
20.6
The basic premise of Chapters 1219 is that whenever variability and uncertainty are intrinsic to a problem, statistics, building on probability, provides
a consistent set of tools for handling such problems systematically. However,
using simple, tailor-made, textbook examples, to demonstrate how various statistical conceptssampling, estimation, hypothesis testing, regression,
experimental design and analysisare applied is one thing; solving real-life
problems with these statistical techniques is another. Real-life problems are
never ideal; also, solving them typically requires choosing the appropriate
combination of these tools and using them appropriately. This chapter therefore has been devoted to using three classes of real-life problems as a capstone
demonstration of statistics in practice. The rst category of problems required
estimating population parameters from samples, carrying out hypothesis tests
about these populations, and using the thus-validated population models to
solve non-trivial problems. With the second problem we demonstrated the
power of simple regression modeling, and a forensic application of hypothesis testing to detect hidden structure in the data; but the problem also served
to illustrate that there is signicant latitude in carrying out data analysis,
since the problem could have been handled several dierent ways (See Project
Assignment 2).
Ironically, the nal problem, on the application of design of experiments
to industrial process optimization, undoubtedly the most practical of the
collection, is the one whose structure is closest to textbook form: a fractional
factorial design to identify signicant factors, followed by a response surface
design to develop a quadratic response model used for optimization, capped
o by factorial conrmation experiments. But the sense of how long it would
have taken to solve this problem (if at all) without the systematic experimental
design strategy discussed, and how much money (and eort) was saved as a
890
Random Phenomena
consequence, impossible to convey adequately in the presentation, are wellappreciated by practitioners of the art.
Taken together then, these case studies illustrate the many faces of reallife problems to which statistics has been successfully applied. The project
assignments below are oered as a way to broaden the readers perspective of
the themes illustrated in this chapter.
This chapter joins Chapters 7 and 11 to complete this books trilogy of
chapter-length cases studies; it also concludes Part IV. The remainder of the
book, Part V, is devoted to a hand-selected trio of special topics each with
roots in probability and statistics, but all of which have since evolved into
legitimate subject matters in their own rights. These topics are therefore applications of probability and statistics but in a much grander sense.
PROJECT ASSIGNMENTS
1. Eect of Bayesian Prior Distributions on Estimation
Just how much of an eect does the prior distribution used for recursive
Bayesian estimation have on the results? Choose two dierent gamma pdfs
as prior distributions for the unknown Poisson parameter, , and repeat the
recursive Bayesian estimation portion of the Prussian army data analysis case
study, using the data in Table 20.3. For each prior distribution,
Obtain year-by-year estimates; compare them to the results presented
in Table 20.4; plot them as in Fig 20.3 with the maximum likelihood
estimate.
Obtain an explicit expression for the nal posterior distribution; plot
the prior and the nal posterior distributions as in Fig 20.4.
Write a report on your analysis, discussing the eects of the prior distributions
you chose on the recursive parameter estimation process.
2. First Principles Population Dynamics Modeling
Consult references on the mathematical modeling of biological populations
(e.g., Brauer and Castillo-Chavez, (2001)4), and use these concepts to develop
an alternative model to represent the US Population data in Table 20.6. The
following two things are required:
Use only data up to and including 1970 to develop the model; validate
the model by using it to predict the 1980, 1990 and 2000 census results.
Use your validated model to predict the 2010 census.
4 Brauer, F. and C. Castillo-Chavez (2001) . Mathematical Models in Population Biology
and Epidemiology, Springer-Verlag, NY.
891
Design Variable
Rr
Rw
Rotor
Body
Bw
Tape Strips
Design Variable
Bl
Tl
Tail
Paper Clip
Tw
Design Variable
892
Random Phenomena
Conduct a series of experiments that will ultimately lead to a mathematical model and an optimal design.
Predict the maximum ight time and perform experiments to conrm
this prediction.
Write a report summarizing your design at the prototype stage, and the
analysis leading to the optimum design. Discuss your analysis methods and
show the nal design, the results of the model predictions, and the conrmation of your predictions.
893
894
Random Phenomena
Part V
Applications
895
897
Part V: Applications
Dealing with Random Variability in Practice
898
Part V: Applications
Dealing with Random Variability in Practice
Chapter 21
Reliability and Life Testing
21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.2 System Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.2.1 Simple Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Series Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Components and Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.2.2 Complex Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Series-Parallel Congurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
k-of-n Parallel System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Systems with Cross-links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.3 System Lifetime and Failure-Time Distributions . . . . . . . . . . . . . . . . . . . . . . . . .
21.3.1 Characterizing Time-to-Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Survival Function, S(t) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Hazard Function, h(t): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.3.2 Probability Models for Distribution of Failure Times . . . . . . . . . . . .
21.4 The Exponential Reliability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.4.1 Component Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.4.2 Series Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.4.3 Parallel Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.4.4 m-of-n Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.5 The Weibull Reliability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.6 Life Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.6.1 The Exponential Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Precision of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.6.2 The Weibull Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.7 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXERCISES AND APPLICATION PROBLEMS . . . . . . . . . . . . . . .
900
901
901
901
903
905
906
906
907
909
910
911
911
911
913
914
914
915
916
917
917
918
919
920
920
920
922
922
923
925
The aphorism Nothing lasts forever (or any of its sundry variations), has
long served philosophers, ancient and modern, as a concise way to convey the
transient and ephemeral nature of the world around us. Specically for the
material things, however, the issue is never so much about not lasting forever,
which is certain; it is more about how long they will last, which is uncertain.
These manufactured and engineered products consist of components with nite functioning lifetimes, and the failure of any of the individual constituent
899
900
Random Phenomena
21.1
Introduction
Engineered and natural systemschemical processes, mechanical equipment, electrical devices, even the human body, etc., consist of individual
units connected in a logical fashion for achieving specic overall system goals.
A basic principle underlying such systems is that how the overall system
performs depends on the individual components performance and how the
components are connected. Plant equipment do fail, as do mechanical and
electrical systems; automobiles break down; and human beings fall sick and
eventually die. While the issue of safety and the consequences of system failure typically dominate most discussions about system performance, of equal
importance is the issue of how reliable these various systems and their constituent components are. For how long will the car start every morning? How
long can the entire renery operate before we need to shut it down for maintenance? How long will the new dishwasher last? These are all issues of reliability,
a concept deeply inuenced by variability and uncertainty surrounding such
questions. The subject matter of reliability therefore relies heavily on probability theory and statistics. The following is one version of a formal denition:
All the qualiers italicized in this denition are important. For example, what
is satisfactory for laboratory use may be inadequate for the harsh commercial
901
(21.1)
21.2
System Reliability
21.2.1
Simple Systems
902
Random Phenomena
C1
C2
Cn
Cn-1
Series Configuration
C1
C2
.
.
.
Cn-1
Cn
Parallel Configuration
C4
C1
C2
C3
C6
C5
903
Series Systems
Consider the system depicted in the top panel of Fig 21.1, where the components are connected in series and the reliability of component Ci is Ri . If
each component operates mutually independently of the others, by which we
mean that the performance of one component has no eect on the performance
of any other component, then:
1. If any component fails, the entire system fails; consequently,
2. The system reliability is obtained as
Rs =
n
Ri
(21.2)
i=1
R1 R2 Rn
(21.3)
(21.4)
If two more identical components are added in series, the system reliability becomes
(21.5)
Rs6 = 0.922 0.982 = 0.886
The system with the higher number of components in series is therefore
seen to be much less reliable.
904
Random Phenomena
Parallel Systems
Now consider the system depicted in the bottom panel of Fig 21.1 where the
components are arranged in parallel, and again, the reliability of component
Ci is Ri . Observe that in this case, if one component fails, the entire system
does not necessarily fail. In the simplest case, the system fails when all n
components fail. In the special k-of-n case, the system will function if at
least k of the n components function. Let us consider the simpler case rst.
In the case when the system fails only if all n components fail, then Rs
is the probability that at least one component functions, which is equivalent to 1 P (no component functions). Now, let Fi be the unreliability of
component i, the probability that the component does not function; then by
denition,
(21.6)
Fi = 1 Ri
If Fs is the system unreliability, i.e., the probability that no component in
the system functions, then by independence,
Fs =
n
Fi
(21.7)
i=1
n
(1 Ri )
(21.8)
i=1
For parallel systems, therefore, we have the product law of unreliabilities expressed in Eq (21.7), from which Eq (21.8) follows. Specic cases of Eq (21.8)
can be informative, as the next example illustrates.
Example 21.2: RELIABILITY OF 2-COMPONENT PARALLEL SYSTEM
Obtain an explicit expression for the reliability of a system consisting of
a parallel arrangement of two components, C1 and C2 , with respective
reliabilities, R1 and R2 . Explain in words what the expression for the
system reliability means in terms of the status of each component.
Solution:
From Eq (21.8), we have, for the two component system,
Rs = 1 (1 R1 )(1 R2 ) = R1 + R2 R1 R2
(21.9)
R1 + R2 (1 R1 ) = R1 + R2 F1
(21.10)
Rs
R2 + R1 (1 R2 ) = R2 + R1 F2
(21.11)
905
C1 functions regardless of the status of component C2 (with a probability R1 ), or (b) C2 functions when C1 fails, with probability R2 F1 . Eq
(21.11) expresses the mirror image circumstance. Thus, these two equivalent expressions show how, in this parallel arrangement, one component
serves as a backup for the other.
(21.12)
When two more components are added in parallel, the system reliability
becomes
(21.13)
Rs = 1 (0.02)4 = 0.99999984
where it is necessary to retain so many decimal places to see that the
system does not quite possess absolutely perfect reliability, but it is very
close.
Thus, we see that by adding more components in parallel, the system
reliability is improved substantially, a reverse of the case with the series
arrangement.
Again, from Eq (21.8) and the laws of multiplication, the order in which the
components are arranged in the parallel conguration is immaterial to the
value of Rs .
Components and Modules
If a box drawn around a set of component blocks in the systems representation has a single line going into the box, and a single line coming out of
it, the collection of components is a called a module. For example, the entire
collection of components in the series arrangement in Fig 21.1 constitutes a
module. Of course, several smaller modules can be created from this larger
one by drawing the module box to contain fewer components. A single component is the smallest (simplest) module. Observe that the entire collection of
906
Random Phenomena
21.2.2
Complex Systems
(21.14)
=
=
R1 R2 R345 R6
R1 R2 R6 [1 (1 R3 )(1 R4 )(1 R5 )]
(21.15)
(21.16)
907
R1= 0.9997
R2= 0.9989
R3= 0.9996
Sampling Pump
Solenoid Valve
Densitometer
R1= 0.9997
Sampling Pump
R3= 0.9996
R2= 0.9989
Densitometer
Solenoid Valve 1
(21.17)
(21.18)
908
Random Phenomena
n!
Rk (1 R)nk
k!(n k)!
(21.19)
(21.20)
i=k
Again, note that because all components Ci are identical, with identical reliabilities, which of the n belongs to the functioning group of k is immaterial.
In the case where
1. The reliabilities might be dierent, or
2. Specic components Ci : i = 1, 2, . . . , (k 1) with corresponding reliabilities, Ri , are required to function, along with at least one of the
remaining (n k + 1) components,
then, for such a k-out-of-n system, the system reliability is:
k1
n
Rs =
Ri 1
(1 Ri )
i=1
(21.21)
i=k
Valve 1
(Air-to-Open)
C1
C2
C3
C4
Pump 2
909
Valve 2
(Air-to-Open)
0.9940149
(21.22)
just a hair over 99.4%, the reliability of the module of 6 required highend workstations. Thus, the combination of the extra high-end workstation plus the 3 back-up low-end ones has the net eect of essentially
preserving the reliability of the mandatory module of 6.
Ri
1 Ri
(21.23)
(21.24)
910
Random Phenomena
(21.25)
(21.27)
(21.29)
(21.30)
21.3
21.3.1
911
From the denition in Eq (21.1), we know that system or component reliability has to do with the probability of the entity in question remaining in
service beyond a given time, t. The so-called system lifetime (or component lifetime) is therefore a random variable, T , having a pdf, f (t), sometimes
known as the time-to-failure (or failure-time) distribution.
We now recall the discussion in section 4.5 of Chapter 4 and note that
even though as a pdf, f (t) can be studied in its own right, there are other
even more relevant ways of characterizing the random variable, T , in addition
to what f (t) provides. These functions were introduced in Chapter 4, but are
re-visited here in their more natural setting.
The Survival Function, S(t)
The survival function was dened in Chapter 4 as
S(t) = P (T > t) = 1 F (t)
(21.31)
But in the specic case where f (t) is a failure-time distribution, this translates
to the probability that a component or system fails before T = t, making F (t)
the complement of the reliability function (as already implied, of course, by
Eq (21.31)). Thus, in lifetime studies, the cdf, F (t), is also the same as the
system unreliability, something we had alluded to earlier in dealing with the
parallel system conguration (see Eq (21.6)), but which was presented simply
as a denition then.
The Hazard Function, h(t):
This function, dened as:
h(t) =
f (t)
f (t)
=
R(t)
1 F (t)
(21.33)
is the instantaneous failure rate, or simple failure rate. Recall from Chapter
4 that h(t)dt is the probability of failure in the interval (t, t+dt), given survival
912
Random Phenomena
until time t, in precisely the same way that f (x)dx is the probability of a
continuous random variable, X, taking on values in the interval (x, x + dx).
The relationship between the hazard function and several other functions
is of some importance in the study of component and system lifetimes. First,
it is related to the reliability function as follows:
From the denition of R(t) as 1 F (t), taking rst derivatives yields
R (t) =
dR(t)
= f (t)
dt
(21.34)
R (t)
R(t)
d
[ln R(t)]
dt
(21.35)
t
0
h( )d
(21.36)
t
0
h( )d
(21.37)
The typical hazard function (or equivalently, failure rate) curve is shown
in Fig 21.6. This is the so-called bathtub curve for representing failure characteristics of many realistic systems, including human mortality. Before discussing the characteristics of this curve, it is important, rst, to clear up a
popular misconception. The rate in failure rate is not with respect to time;
rather it is the proportion (or percentage) of the components surviving until
time t that are expected to fail in the innitesimal interval (t, t + t). This
rate is comparable to the rate in interest rate in nance, which refers not
to time, but to the proportion of principal borrowed.
The failure rate curve is characterized by 3 distinct parts:
1. Initial Period : t t0 , characterized by a relatively high failure rate
that decreases as a function of time. This is the early failure period
where inferior items in the population fail quickly. This is also known
as the infant mortality period, the analogous characteristic in human
populations.
2. Normal Period : t0 t t1 , characterized by constant failure rate.
This is the period of useful life of many products where failure is due to
purely random chance, not systematic problems.
3. Final period : t t1 , characterized by increasing failure rate primarily attributable to wear-out (the human population analog is old-age
mortality).
913
Failure
Rate
Infant
Mortality
Old-Age
Mortality
Early
failure
Random
Chance
failure
Wearout
failure
t0
t1
Time
FIGURE 21.6: Typical failure rate (hazard function) curve showing the classic three
distinct characteristic periods in the lifetime distributions of a population of items
In light of such characteristics, manufacturers often improve product reliability by (i) putting their batch of manufactured products through an initial
burn in period of pre-release operation until time t0 to weed out the inferior items, and (ii) where possible, by replacing (or at least recommending
replacement) at t1 to avoid failure due to wear-out. For example, this is the
rationale behind the 90,000-mile timing belt replacement recommendation for
some automobiles.
21.3.2
(21.38)
914
Random Phenomena
(21.39)
This reliability function is valid in the normal period of the product lifetime,
the middle section of the failure rate curve.
During the initial and nal periods of product lifetimes, the failure
rates are not constant, decreasing in one and increasing in the other. A more
appropriate failure-rate function is
h(t) = (t)1 ; t > 0
(21.40)
a very general failure rate function: when < 1 it represents a failure rate
that decreases with time, the so-called decreasing failure rate (DFR) model;
> 1 represents an increasing failure rate (IFR) model; and when = 1,
the failure rate is constant at . This expression therefore covers all the three
periods of Fig 21.6.
The corresponding pdf, f (t), for this general hazard function is obtained
from Eq (21.37) as
f (t) = (t)1 e(t)
(21.41)
(21.42)
21.4
21.4.1
(21.43)
(21.44)
915
(21.45)
The probability that the same item will remain in service for at least
5000 hours is
Ri (5000) = e[(0.02/1000)5000] = e0.1 = 0.905
(21.46)
1
1000
=
= 5 104 hrs
0.02
(21.47)
which is constant.
21.4.2
Series Conguration
n
n
ei t = e(
i=1
s t
where
s =
i=1
i )t
(21.48)
n
i=1
(21.49)
916
Random Phenomena
1
i
(21.50)
1
=
1 + 2 + + n
1
1
1
2
1
+ +
1
1
1
1
=
+
+ +
s
1
2
n
1
n
(21.52)
(21.53)
(21.54)
(21.55)
i.e., the system failure rate is n times the component failure rate, and the
MTBF is 1/n times that of the component.
21.4.3
Parallel Conguration
n
1 ei t
(21.56)
i=1
which is not the reliability function for an exponential failure-time distribution. In the special case where the component failure rates are identical, this
expression simplies to,
n
(21.57)
Rs (t) = 1 1 et
917
In general, the expression for the MTBF is dicult to derive, but it can be
shown that it is given by:
1
1
s = 1 + + +
(21.58)
2
n
and the system failure rate by
1
1
=
s
1
1
1 + + +
2
n
(21.59)
Some important implications of these results for the parallel system conguration are as follows:
1. The MTBF for a system of n identical components in parallel is the
indicated series sum to n terms multiplied by the individual component
MTBF. Keep in mind that this assumes that each defective component
is replaced when it fails (otherwise n cannot remain constant).
2. In going from a single component to two in parallel, the MTBF increases
by a factor of 50% (from to 1.5), not 100%.
3. The law of diminishing returns is evident in Eq (21.59): as far as
MTBF for a parallel system conguration is concerned, the incremental
benets accruing from adding one more component to the system, goes
to zero as n .
21.4.4
k t k1
e t
(k)
(21.60)
k
= (n m)
(21.61)
918
21.5
Random Phenomena
As noted earlier, the exponential reliability model is only valid for constant
failure rates. When failure rates are time dependent, the Weibull model is
more appropriate. Unfortunately, the Weibull model, being quite a bit more
complicated than the exponential model, does not lend itself as easily to the
sort of closed form analysis presented above for the exponential counterpart.
The component reliability is given from Eq (21.42) by:
Ri (t) = e(i t)
(21.62)
and when the failure rate exponent is assumed to be the same for n components connected in series, the resulting system reliability is:
Rs (t) =
n
n
e(i t) = e(
i=1
i )t
i=1
e(s t)
where:
s =
n
(21.63)
1/
i
(21.64)
i=1
1
(1 + 1/)
(1 + 1/) =
1/
n
s
i=1 i
(21.66)
Rs (t) = 1
1 e(i t)
(21.67)
i=1
21.6
919
Life Testing
The experimental procedure for determining component and system reliability and lifetimes parallels the procedure for statistical inference discussed
in Part IV: it involves selecting an appropriate random sample of components and testing them under prescribed conditions. The relevant data are
the times-to-failure observed for individual components of system. Such experiments are usually called life tests and the general procedure is known as
life testing. There are several dierent types of life tests, a few of the most
common of which are listed below:
1. Replacement tests: where each failing component is replaced by a new
one immediately upon failure;
2. Nonreplacement tests: where a failing component is not replaced;
3. Truncated tests: where, because the mean lifetime is so long that testing to failure is impractical, uneconomical, or both, the test is stopped
(truncated) after (i) a xed pre-specied time, or (ii) the rst r < n
failures;
4. Accelerated tests: where high-reliability components are tested under
conditions far more severe than normal, in order to accelerate component failure and thereby reduce test time and the total number of components to be tested. The true natural reliability is extracted from such
accelerated tests via standard analysis tools calibrated for the implied
time-compression.
Once more, we caution the reader that the ensuing abbreviated discussion is
meant to be merely illustrative, nowhere near the fuller, more comprehensive
discussion of the fundamental principles and results that are available in such
book-length treatments as that in Nelson, 20031.
21.6.1
As shown earlier in Section 21.4, the exponential model is the most appropriate lifetime model during the useful life period. The main feature of the life
tests for this model is that n components are life-tested independently and
testing is discontinued after r n have failed. The experimental result is the
set of observed failure times: t1 t2 t3 tr , where ti is the failure time
of the ith component to fail.
Statistical inference in this case involves the usual problems: estimation
of the key population parameter, = 1/ of the exponential failure-time
1 Nelson,
920
Random Phenomena
distribution, the mean component lifetime; and hypothesis testing about the
parameter estimate, but with a twist.
Estimation
It can be shown that an unbiased estimate for is given by:
r
r
(21.68)
where r is the accumulated life until the rth failure given by:
r =
r
ti + (n r)tr
(21.69)
i=1
for non-replacement tests. The rst term is the total lifetime of the r failed
components; the second term is the lower bound on the remaining lifetime of
the surviving (n r) components. Note that for these non-replacement tests,
if r = n, then
is exactly equal to the mean of the observed failure times.
For replacement tests,
(21.70)
r = ntr
From here, the failure rate is estimated as
= 1/
(21.71)
= et/
R(t)
(21.72)
It can be shown that these estimates are biased, but the bias diminishes as
the sample size n increases.
Precision of Estimates
The statistic
Wr =
2Tr
(21.73)
<<
2Tr
2
1/2 (2r)
(21.75)
921
TABLE 21.1:
Summary of H0
rejection conditions for the test of
hypothesis based on an exponential model
of component failure-time
For general
Testing Against Reject H0 if:
Ha : < 0
r < 21 (2r)
Ha : > 0
r > 2 (2r)
Ha : = 0
Hypothesis Testing
To test the null hypothesis
H0 : = 0
against the usual triplet of alternatives:
Ha :
Ha :
> 0
< 0
Ha :
= 0
again, we follow the principles discussed in Chapter 15. We use the test statistic Wr dened in Eq (21.73) and its sampling distribution, the 2 (2r) distribution, to obtain the usual rejection criteria, shown for this specic case
in Table 21.1, where r is the specic value of the statistic obtained from
experimental data, i.e.,
2r
r =
(21.76)
0
Even though all these closed-form results are available, none of these statistical inference exercises are conducted by hand any longer. As with the
examples discussed in Chapters 14 and 15, computer programs are routinely
used for such data analysis. Nevertheless, we use the next example to illustrate
some of the mechanics behind the computations.
Example 21.7: LIFE TESTS FOR ENERGY-SAVING LIGHT
BULBS
To characterize the lifetime of a new brand of energy-saving light bulbs,
a sample of 10 were tested in a specially designed facility where they
could be left on continuously and monitored electronically to record the
precise number of hours until burn out. The experimental design calls
for halting the test immediately after 8 of the 10 light bulbs have burned
922
Random Phenomena
out. The result, in thousands of hours, arranged in increasing order is
as follows:
(1.599, 3.380, 5.068, 8.478, 8.759, 9.256, 11.475, 14.382)
i.e., the rst light bulb to burn out did so after 1,599 hours, the next
after 3,380 hours, and the 8th and nal one after 14,382 hours. Obtain
an estimate of the mean lifetime and test the hypothesis that it is 12,000
hours against the alternative that it is not.
Solution:
For this problem,
r = 62.397 + (10 8) 14.382 = 91.161
(21.77)
= 11.40
(21.78)
(21.79)
20.025 (16)
= 6.91 and
And from the chi-square distribution, we obtain
20.975 (16) = 28.8. And now, since 15.194 does not lie in the rejection
region, we nd no evidence to reject the null hypothesis. We therefore
conclude that it seems reasonable to assume that the true mean lifetime
of this new brand of light bulb is 12,000 hours as specied.
21.6.2
1
(1 + 1/)
(21.80)
Life testing is aimed at acquiring data from which the population parameters,
and will be estimated. Unfortunately, unlike with the exponential model,
estimating these Weibull parameters can be tedious and dicult, requiring either numerical methods or old-fashioned graphical techniques that are based
on many simplifying approximations. Even more so than with the relatively
simpler exponential model case, computer software must be employed for carrying out parameter estimation, and hypothesis tests for the Weibull model.
Additional details lie outside the intended scope of this chapter but are
available in the book by Nelson (2003), which is highly recommended to the
interested reader.
21.7
923
The exposure to the topic of reliability and life testing provided in this
chapter was designed to serve two purposes. First is the general purpose of
Part Vto showcase, no matter how briey, some substantial subject matters
that are based entirely on applications of probability and statistics. Second
is the specic purpose of illustrating how the reliability and the lifetimes
of components and systems are characterized and analyzed. The scope of
coverage was deliberately limited, but still with the objective of providing
enough material such that the reader can develop a sense of what these studies
entail. We presented reliability, for a component or a system, as a probability
the probability that the component or system functions as desired, for at least
a specied period of time. The techniques discussed for determining system
reliability given component reliabilities and system conguration produced
some interesting results, two of which are summarized below:
Product Law of Reliabilities: For systems consisting of n components
reliability, Rs ,
connected in series, each with reliability, Ri , the system*
n
is a product of the component reliabilities; i.e., Rs = i=1 Ri . Since
0 < Ri < 1, the reliability of a system of series components therefore
diminishes as the number of components increases.
Product Law of Unreliabilities: When the n components of a system are
arranged in parallel, the system unreliability, *
(1 Rs ), is a product of
n
the component unreliabilities; i.e., 1 Rs = i=1 (1 Ri ). Thus, the
reliability of a system of parallel components improves as the number of
components increases; the additional components simply act as redundant backups.
Computing the reliabilities of more complex systems requires reducing such
systems to a collection of simple modules and, in the case of systems with
cross-links, using a keystone and invoking Bayes theorem of total probability.
As far as specic models of failure times are concerned, we focused only on
the exponential and Weibull models, the two most widely used in practice. In
discussing failure time distributions and their characteristics, we were able to
revisit some of the special lifetime distributions presented earlier in Chapter 4
(especially the survival function and the hazard function) here in their more
natural habitats.
While reliability analysis depends entirely on probability, not surprisingly,
life testing, the experimental determination of component and system reliability characteristics, relies on statistical inference: estimation and hypothesis
testing. How these ideas are used in practice is illustrated further with the
end-of-chapter exercises and problems.
924
Random Phenomena
REVIEW QUESTIONS
1. What is the denition of the reliability of a component or a system?
2. What are the two factors that determine the overall reliability of a system consisting of several components?
3. What is the dening problem of system reliability?
4. In terms of system conguration, what is a simple system as opposed to a
complex system?
5. What is the product law of reliabilities, and to which system conguration does
it apply?
6. Why is system reliability for a series conguration independent of the order in
which the components are arranged?
7. What is the product law of unreliabilities, and to which system conguration does
it apply?
8. As n, the number of components in a series conguration increases, what happens
to Rs , system reliability? Does it increase or decrease?
9. As n, the number of components in a parallel conguration increases, what happens to Rs , system reliability? Does it increase or decrease?
10. What is a module?
11. What is the analysis technique for determining the reliability of series-parallel
systems?
12. What is a k-of-n parallel system?
13. Why is it more complicated than usual to determine the reliability of systems
with cross-links?
14. What special component designation is needed in analyzing the reliability of
systems with cross-links?
15. What is the survival function, S(t), and how is it related to the cumulative
distribution function, F (t)?
16. In lifetime studies, the cumulative distribution function, F (t), is the same as
what system characteristic?
17. What is the hazard function, h(t), and how is it related to the standard pdf, f (t)?
925
18. Why is the failure rate (hazard function) curve known as the bathtub curve?
19. The rate in the failure rate is not with respect to time; it is with respect to
what?
20. What are the three distinct parts of the hazard function (failure rate) curve?
21. What is the distribution of failure times for random chance failure, with constant failure rates, ?
22. What is the reliability function for components or systems with exponential
failure-time distributions?
23. What is the denition of mean-time-between-failures (MTBF)?
24. What is a decreasing failure rate (DFR) as opposed to an increasing failure rate
(IFR) model?
25. What is the reliability function for components or systems with the Weibull
failure-time distribution?
26. What is the failure time distribution for a series conguration of n components
each with an exponential failure-time distribution?
27. What is the relationship between the MTBF of a series conguration of n components each with exponential failure-time distributions, and the MTBFs of the
components?
28. In what way is the law of diminishing returns manifested in the MTBF for a
parallel conguration of n identical systems with exponential reliability?
29. What is the MTBF for an m-of-n parallel system with exponential component
reliabilities?
30. What is the failure-time distribution for a series conguration of n components
each with a Weibull failure-time distribution?
31. What is life testing?
32. What is a replacement test, a non-replacement test, a truncated test, or an accelerated test?
33. In life testing, what is the accumulated life until the r th failure?
34. What test statistic is used in life testing with the exponential model? What is
its sampling distribution?
926
Random Phenomena
G1
RF1
G2
RF2
G3
W
Water Supply
927
Heat Exchangers
P1
HX1
P2
HX2
P3
HX3
2/3
(ii) If the power plant were redesigned such that only one of the heat exchangers is
required, by how much will the system reliability increase?
21.4 Rs , the reliability of the system shown below, was obtained in the text using
component C3 as the keystone.
Pump 1
Valve 1
(Air-to-Open)
C1
C2
C3
C4
Pump 2
Valve 2
(Air-to-Open)
FIGURE 21.9: Fluid ow system with a cross link (from Fig 21.5)
(i) Choose C2 as the keystone and obtain Rs again. Compare your result with Eqn
(21.30) in the text.
(ii) Now choose C1 as the keystone and repeat (i). Compared with the derivation
required in (i), which keystone choice led to a more straightforward analysis?
(iii) Given specic component reliabilities for the system as: R1 = 0.93; R2 =
0.99; R3 = 0.93; R4 = 0.99, where Ri represents the reliability of component Ci ,
compare the reliability of the system with and without the cross-link and comment
on how the presence of the cross-link aects this specic systems reliability.
21.5 An old-fashioned re alarm system consists of a detector D and an electrically
operated bell, B. The system works as follows: if a re is detected, a circuit is completed and the electrical signal reaching the bell will cause it to ring. The reliability
of the detector is 0.9 and that of the bell is 0.995.
(i) What is the reliability of the complete re alarm system?
(ii) If another identical detector and bell combination is installed in standby, by how
much will the reliability of the new augmented re alarm system improve?
(iii) It has been recommended, as a cost-saving measure, to purchase for the back-
928
Random Phenomena
D
D2
B2
Backup Bell
Backup
Detector
S1
C1
HX1
S2
C2
HX2
929
21.8 Pottmann. et al., (1996)2 presented the following simplied block diagram for
the mammalian blood pressure control system. The baroreceptors are themselves
systems of pressure sensors, and the sympathetic and parasympathetic systems are
separate control systems that are part of the nervous systems. These subsystems are
not entirely all mutually independent; for the purposes of this problem, however,
they can be considered
as such. Consider an experimental rat for which the indicated
Type II
Baroreceptor
Sympathetic
System
Pressure
Setpoint
Cardiovascular
System
Heart
Parasympathetic
System
Carotid Sinus
Pressure
Heart
Rate
Type I
Baroreceptor
930
Random Phenomena
preventing accidents; the higher the safety system reliability, the lower the risk,
but it will never be zero. Second, attaining high system reliability is not cheap,
whether it is realized with individual components with high reliability, or by multiple redundancies. Last, but not least, even though high reliability is expensive,
the repercussions of a single safety catastrophe with this manufacturing process is
enormous in nancial terms, besides the lingering eects of bad publicity that can
take decades to overcome, if ever.
Engineers designing a safety system for a specic plant were therefore faced with
a dicult optimization problem: balancing the high cost of a near-perfect system
against the enormous nancial repercussions of a single catastrophic safety event.
But an optimum solution can be obtained as follows.
Let C0 be the cost of a reference system with mediocre reliability of 0.5 (i.e., a
system with a 50% probability of failure). For the particular process in question, the
total cost of installing a system with reliability Rs is given by:
R =
C0 Rs
1 Rs
(21.81)
(21.82)
where, of course, (1Rs ) is the probability of system failure. Note that, as expected,
this is a monotonically (specically, linearly) decreasing function of system reliability.
We may now observe that the ideal system will be one with a reliability that
minimizes the total expected costs, achieving a high enough reliability to reduce the
risk of failure, but not so much that the cost of reliability is prohibitive.
(i) Determine such a reliability, R , by minimizing the objective
= F + R = CF (1 Rs ) +
C0 Rs
1 Rs
(21.83)
the total expected costs; show that the desired optimum reliability is given by:
)
C0
R =1
=1
(21.84)
CF
where is the ratio of the base reference cost of reliability to the cost of failure.
Discuss why this result makes sense by examining the prescribed R as a function
of the ratio of the two costs, C0 and CF .
(ii) For a particular system where C0 = $20, 000 and for which CF = $500 million,
determine R , and the cost of the safety system whose reliability is R .
(iii) If the cost of the catastrophe were to double, determine the new value for R ,
and the corresponding cost of the recommended safety system.
(iv) If a single composite system with reliability R determined in (ii) above is unavailable, but only a system with reliability 0.85, how many of the available systems
will be required to achieve the desired reliability? How should these be congured?
931
21.10 For certain electronic components, survival beyond an initial period from
t = 0 to t = is most crucial because thereafter, the failure rate becomes virtually negligible. For such cases, the hazard function (i.e., the failure rate) may be
approximated as follows:
(1 t/ ) 0 < t <
h(t) =
(21.85)
0
otherwise
Obtain an expression for f (t), the failure time distribution, and the corresponding cumulative distribution, F (t). From these results show that for such electronic
components, the reliability function is given by:
R(t) = et(1t/2 )
during the initial period, 0 < t < , and that thereafter (for t > ), it is:
R(t) = e /2
21.11 The time to failure, T , of an electronic component is known to be an exponentially distributed random variable with pdf
et ; 0 < x <
f (t) =
0;
elsewhere
where the failure rate, = 0.075 per 100 hours of operation.
(i) If the component reliability function Ri (t) is dened as
Ri (t) = P (T > t)
(21.86)
the probability that the component functions at least up until time t, obtain an
explicit expression for Ri (t) for this electronic component.
(ii) A system consisting of two of such components in parallel functions if at least
one of them functions; again assuming that both components are identical, nd the
system reliability Rp (t) and compute Rp (1000), the probability that the system survives at least 1,000 hours of operation.
21.12 The failure time (in hours) for 15 electronic components is given below:
337.0
408.9
290.5
183.4
219.7
174.2
739.2
330.8
900.4
102.2
36.7
731.4
348.6
73.5
44.7
(i) First conrm that the data is reasonably exponentially distributed and then obtain an estimate of mean life time.
(ii) The company that manufactures the electronic components claims that the mean
life time is 400 hours. Test this hypothesis against the alternative that the mean lifetime is lower. What is the conclusion of this test?
(iii) Using the estimated mean life time to determine the exponential population
mean failure rate, , compute the probability that a system consisting of two of
these components in parallel functions beyond 400 hours.
21.13 Refer to Problem 12.12. This time, consider that the life test was stopped,
by design, after 500 hours. Repeat the entire problem and compare the results. How
close to the full data results are the results from the truncated test?
932
Random Phenomena
Chapter 22
Quality Assurance and Control
22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.2 Acceptance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.2.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic Characteristics of Sampling Plans . . . . . . . . . . . . . . . . . . . . . . .
22.2.2 Determining a Sampling Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Operating Characteristic (OC) Curve . . . . . . . . . . . . . . . . . . . . .
Approximation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Characteristics of the OC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.3 Process and Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.3.1 Underlying Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.3.2 Statistical Process Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.3.3 Basic Control Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Shewhart Xbar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The S-Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Variations to the Xbar-S Chart: Xbar-R, and I & MR Charts .
The P-Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The C-Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.3.4 Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Western Electric Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CUSUM Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EWMA Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.4 Chemical Process Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.4.1 Preliminary Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.4.2 Statistical Process Control (SPC) Perspective . . . . . . . . . . . . . . . . . . . .
22.4.3 Engineering/Automatic Process Control (APC) Perspective . . . . .
22.4.4 SPC or APC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
When SPC is More Appropriate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
When APC is More Appropriate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.5 Process and Parameter Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.5.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.5.2 A Theoretical Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REVIEW QUESTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PROJECT ASSIGNMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. Tracking the Dow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2. Diabetes and Process Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3. C-Chart for Sports Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
934
935
936
936
937
938
940
942
943
943
944
944
946
947
948
950
954
957
958
958
959
960
961
964
964
965
965
967
968
968
969
969
970
971
972
975
975
976
976
933
934
Random Phenomena
Mass production, a uniquely 20th century invention, unquestionably transformed industrial productivity by making possible the manufacture of large
quantities of products in a relatively short period of time. But making a product faster and making lots of it does not necessarily mean much if the product
is not made well. If anything, making the product well every time and all the
time, became a more challenging endeavor with the advent of mass production.
It is only natural, therefore, that assuring the quality of mass produced goods
has since become an integral part of any serious manufacturing enterprise. At
rst, acceptance sampling was introduced by customers to protect themselves
from inadvertently receiving products of inferior quality and only discovering
these defective items afterwards. Before a manufactured lot is accepted by
the consumer, the strategy calls for a sample to be tested rst, with the results of the test serving as a rational basis for deciding whether to accept the
entire lot or to reject it. Producers later incorporated acceptance sampling
into their product release protocols to prevent sending out inferior quality
products. However, such an after-the-fact strategy was soon recognized as
inecient and, in the long run, too expensive. The subsequent evolution of
quality assurance through process and quality control (where the objective is
to identify causes of poor quality and correct them during production) to the
total quality management philosophy of zero defects (which requires, in addition, the design of processes and process operating parameters to minimize the
eect of uncontrollable factors on product quality) was rapid and inevitable.
A complete and thorough treatment of quality assurance and control requires more space than a single chapter can aord. As such, our objective
in this chapter is more modestly set at providing an overview of the key
concepts underlying three primary modes of quality assurance. We discuss
rst acceptance sampling, from the consumers as well as the producers perspectives; we then discuss in some detail, process and quality control, where
the focus is on the manufacturing process itself. This discussion covers the
usual terrain of statistical process control charts, but adds a brief section
on engineering/automatic process control, comparing and contrasting the two
philosophies. The nal overview of Taguchi methods is quite brief, providing
only a avor of the concepts and ideas.
22.1
Introduction
935
936
Random Phenomena
22.2
Acceptance Sampling
22.2.1
Basic Principles
937
938
22.2.2
Random Phenomena
Let the random variable X represent the number of defective items found
in a sample of size n, drawn from a population of size N whose true but
unknown fraction of defectives is . Clearly, from Chapter 8, we know that X
is a hypergeometric random variable with pdf:
N N N
f (x) =
Nnx
(22.1)
(22.2)
At the most basic level, the problem is, in principle, that of estimating
from sample data, and testing the hypothesis:
H0 : 0
(22.3)
Ha : > 0
(22.4)
= P (X c) =
=
c
f (x)
x=0
N N N
c
x
Nnx
n
x=0
(22.5)
x=0
939
0.198
Probability of Acceptance
1.0
D
0.95
0.8
0.6
0.4
0.2
E
0.1
0.0
RQL T 1
AQL
0.0
T0
0.1
0.2
Lot Proportion Defective
0.3
0.4
FIGURE 22.1: OC Curve for a lot size of 1000, sample size of 32 and acceptance
number of 3: AQL is the acceptance quality level; RQL is the rejection quality level.
1. Clearly if = 0 (no defective item in the lot), the probability of acceptance is 1; i.e.,
P (A|0) = 1
(22.7)
2. As increases, P (A|) decreases; in particular, if = 1,
P (A|1) = 0
(22.8)
940
Random Phenomena
=
=
1 ;
(22.9)
(22.10)
where is the producers risk, and , the consumers risk, and if we retain the
denitions given above for 0 , the AQL, and 1 , the RQL, then the following
two equations:
1=
=
c
x=0
c
x=0
f (x|0 ) =
f (x|1 ) =
c
x=0
c
N 0 N N 0
x
N nx
n
N 1 N N 1
x
N nx
n
x=0
(22.11)
(22.12)
locate two points on the OC curve such that (i) there is a probability of
rejecting a lot with true defective fraction, , that is less than the AQL, 1 ,
and (ii) a probability of accepting (more precisely, not rejecting) a lot with
true defective fraction, , that is higher than the RQL. These two equations
therefore allow simultaneous consideration of both risks.
Given N, 0 and 1 , the only unknowns in these equations are n and c; x
is an index that runs from 0 to c. The simultaneous solution of the equations
produces the sampling plan. In general, there are no closed form analytical
solutions to these equations; they must be solved numerically with the computer. If the specied values for 0 , 1 , and are reasonable such that a
feasible solution of an n, c pair exists, the obtained solution can then be used
to generate the OC curve for the specic problem at hand. Otherwise, the
specied parameters will have to be adjusted until a feasible solution can be
found.
Thus, to generate a sampling plan, one must specify four parameters: (i)
0 , the acceptable quality level (AQL); (ii) 1 , the rejectable quality level
(RQL), along with (iii) , the producers risk, and (iv) , the consumers risk.
The resulting sampling plan is the pair of values n and c, used as follows:
the number of samples to take from the lot and test is prescribed as n; after
testing, x, the number of defectives found in the sample is compared with c;
if x c the lot is accepted; if not the lot is rejected.
941
Approximation Techniques
Before presenting an example to illustrate these concepts, we note that
the computational eort involved in solving Eqs (22.11) and (22.12) can be
reduced signicantly by employing well-known approximations to the hypergeometric pdf. First, recall that as N the hypergeometric pdf tends to
the binomial pdf, so that for large N , Eq (22.6) becomes
c
n x
(22.13)
P (A|)
(1 )nx
x
x=0
which is signicantly less burdensome to use, especially for large N . This is
the binomial approximation OC curve.
When the sample size, n, is large, and hence Eq (22.13) itself becomes
tedious, or when the quality assessment is based not on the binary go/no
go attribute of each tested item but on the number of defects per item, the
Poisson alternative to Eq (22.13) is used. This produces the Poisson OC curve.
Recall that as n and 0 but in such a way that n = , the binomial
pdf tends to the Poisson pdf; then under these conditions, Eq (22.13) becomes
P (A|)
c
(n)x en
x!
x=0
(22.14)
942
Random Phenomena
Acceptance Sampling by Attributes
Measurement type: Go/no go
Lot quality in proportion defective
Lot size: 1000
Use binomial distribution to calculate probability of acceptance
Acceptable Quality Level (AQL)
0.04
Producers Risk (Alpha)
0.05
Rejectable Quality Level (RQL or LTPD)
Consumers Risk (Beta)
Generated Plan(s)
Sample Size
Acceptance Number
0.2
0.1
32
3
If the AQL and RQL specications are changed, the sampling plan will
change, as will the OC curve. For example, if in Example 22.1 the AQL is
changed to 0.004 (only 4 items in 1000 are allowed to be defective for the lot
to be acceptable) and the RQL changed to 0.02 (if 20 items or more in 1000
are defective, the lot will be unacceptable), then the sampling plan changes
to n = 333 while the acceptance number remains at 3; i.e., 333 samples will
be selected for inspection and the lot will be accepted only if the number of
defectives found in this sample is 3 or fewer. The resulting OC curve is shown
in Fig (22.2) where the reader should pay close attention to the scale on the
x-axis, compared to that in Fig (22.1).
Characteristics of the OC Curve
Upon some reection, we see that the shape of the OC curve is actually
indicative of the power of the sampling plan to discriminate between good
and bad lots. The steeper the OC curve the better it is at separating good
lots from bad; and the larger the sample size, n, the steeper the OC curve.
(Compare, for example, Figs 22.1 and 22.2 on the same scale.)
The shape of the ideal OC curve is a perfect narrow rectangle around the
value of the AQL, 0 : every lot with = 0 will be accepted and every lot
with > 0 rejected. However, this is unrealistic for many reasons, not least of
all being that to obtain such a curve will require almost 100% sampling. The
reverse S-shaped curve is more common and more practical. Readers familiar
with the theory of signal processing, especially lter design, may recognize
the similarity between the OC-curve and the frequency characteristics of lters: the ideal OC curve corresponds to a notch lter while the typical OC
curves correspond to low pass, rst order lters with time constants of varying
magnitude.
Finally, we note that the various discussion of Power and Sample size
943
Probability of Acceptance
1.0
0.8
0.6
0.4
0.2
0.0
0.00
0.01
0.02
Lot Proportion Defective
0.03
0.04
FIGURE 22.2: OC Curve for a lot size of 1000, generated for a sampling plan for
an AQL= 0.004 and an RQL = 0.02, leading to a required sample size of 333 and
acceptance number of 3. Compare with the OC curve in Fig 22.1.
in Chapter 15 could have been framed in terms of the OC curve; and in fact
many textbooks do so.
Other Considerations
There are other issues associated with acceptance sampling, such as Average Outgoing Quality (AOQ), Average Total Inspection (ATI), and the development of acceptance plans for continuous measures of quality and the
concept of the acceptance region; these will not be discussed here, however.
It is important for the reader to recognize that although important from a
historical perspective, acceptance sampling is not considered to be very costeective as a quality assurance strategy from the perspective of the producer.
It does nothing about the process responsible for making the product and
has nothing to say about the capability of the process to meet the customers
quality requirements. It is an after-the-fact, post-production strategy that
cannot be the primary tool in the toolbox of a manufacturing enterprise that
is serious about producing good quality products.
944
22.3
22.3.1
Random Phenomena
22.3.2
945
, exactly equals desired target value 0 (or the historical, long-term average
value).
At the most fundamental level, therefore, statistical process control involves taking a representative process variable whose value, Y , is either a
direct measure of the product quality of interest, or at least related to it,
and assessing whether the observed value, y, is stable and not signicantly
dierent from 0 . Because of inherent variability associated with sampling,
and also with the determination of the measured value itself, this problem requires probability and statistics. In particular, observe that one can pose the
question: Is y signicantly dierent from 0 ? in the form of the following
hypothesis test:
H0 :
Ha :
Y = 0
Y = 0
(22.15)
a problem we are very familiar with solving, provided an appropriate probability model is available for Y . In this case, we do not reject the null hypothesis,
at the signicance level of , if (YL Y YU ), where the values YL and YU
at the rejection boundary are determined from the sampling distribution such
that:
P (YL Y YU ) = 1
(22.16)
This equation is the foundation of one of an iconic characteristic of SPC;
it suggests a convenient graphical technique involving 3 lines:
1. A center line for 0 , the desired target for the random variable, Y ;
2. An Upper Control Limit (UCL) line for YU ; and
3. A Lower Control Limit (LCL) line for YL
on which each acquired value of Y is plotted. Observe then that a point falling
outside of these limits signals the rejection of the null hypothesis in favor of
the alternative, at the signicance level of , indicating an out-of-control
status. A generic SPC chart of this sort is shown in Fig 22.3 where the sixth
data point is out of limits. Points within the control limits are said to show
variability attributable to common cause eects; special cause variability
is considered responsible for points falling outside the limits.
In traditional SPC, when an out-of-control situation is detected, the
recommended corrective action is to nd and eliminate the problem, the
practical implementation of which is obviously process-specic so that the
instruction cannot be more specic than this. But in the discrete-parts manufacturing industry, there is signicant cost associated with nding and correcting problems. There is therefore signicant incentive to minimize false
out-of-control alarms.
Finally, before beginning the discussion of specic charts, we note that the
nature of the particular quantity Y that is of interest in any particular case
946
Random
Phenomena
YU (UCL)
Y*
YL (LCL)
FIGURE 22.3: A generic SPC chart for the generic process variable Y indicating a
sixth data point that is out of limits.
clearly determines the probability model underlying the chart, which in turn
determined how YU and YL are determined. The ensuing discussion of various
SPC charts is from this perspective.
22.3.3
Control charts are graphical (visual) means of monitoring process characteristics. They typically consist of two plots: one for monitoring the mean
value of the process variable in question; the other for monitoring the variability, although the chart for the mean customarily receives more attention.
These charts are nothing but graphical means of carrying out the hypotheses
tests: H0 : Process Mean = Target; Process Variability = Constant, versus the
alternative: Ha : Process mean = Target; and/or Process Variability = Constant. In practice, these tests are implemented in real-time by adding each new
set of process/product data as they become available. Modern implementations involve displays on computer screens that are updated at xed intervals
of time, with alarms sounding whenever an alarm-worthy event occurs.
It is important to stress that the control limits indicated in Fig 22.3 are
not specication limits; these control limits strictly arise from the sampling
distribution of the process variable, Y , and are indicative of typical variability
intrinsic to the process. The control limits enable us determine if observed variability is in line with what is typical. In the language of the quality movement,
these control limits therefore constitute the voice of the process. Specication limits on the other hand have nothing to do with the process; they are
specied by the customer, independent of the process, and therefore constitute
what is known as the voice of the customer.
947
A few of the various charts that exist for various process and product
variables and attributes are now discussed.
The Shewhart Xbar Chart
By far the oldest, most popular and most recognizable control chart is the
Shewhart chart, named for Walter A. Shewhart (18911967), the Bell Labs
physicist and engineer credited with pioneering industrial statistical quality
control. In its most basic form, it is a chart used to track the sample mean, X,
of a process or product variable: for example, the mean outer diameter of ball
bearings; the mean length of 6-inch nails; the mean liquid volume of 12-ounce
cans of soda; the mean Mooney viscosity of several samples of an elastomer,
the sample mean of the process
etc. The generic variable Y in this case is X,
measurements.
The data requirement is as follows: a random sample, X1 , X2 , . . . , Xn is
and stanobtained from the process in question, from which the average, X,
dard deviation, SX , are computed. The probability model underlying the Shewhart chart is the gaussian distribution, justied as follows. There are many
instances where the variable of interest, X, is itself approximately normally
N (0 , 2 ); but even when X is not normally
distributed, in which case X
X
2
distributed, for most random variables, N (0 , X
) is a reasonable approx
imate distribution for X, given a large enough sample (as a result of the
Central Limit Theorem).
we are able to compute the following
With this sampling distribution for X,
probability:
0 < 3X ) = 0.9974
(22.17)
P (3X < X
providing the characteristic components of the Shewhart chart: the control
limits are 3X to each side of the target value 0 on the center line; and the
condence level is (1)100% = 99.7%. The bounds are therefore commonly
known as 3-sigma limits. The -risk of false out-of-control alarms is thus
very low at 0.003. An example follows.
Example 22.2: X-BAR CONTROL CHART FOR 6-INCH
NAILS
Every ve minutes, a random sample of 3 six-inch nails is selected from
a manufactured lot and measured for conformance to the specication.
The data in Table 22.1 is a record of the measurements determined over
the rst hour of a shift. Obtain an X-bar chart and identify whether or
not the manufacturing process is in-control.
Solution:
The points to be plotted are the averages of the three samples corresponding to each sample time; the center line is the target specication
of 6 inches. To obtain the control limits, however, observe that we have
not been given the process standard deviation. This is obtained from
the data set itself, assuming that the process is in-control.
Computer programs such as MINITAB can be used to obtain
948
Random Phenomena
Xbar Chart of Nails
6.2
UCL=6.1706
Sample Mean
6.1
P
6.0
5.9
LCL=5.8294
5.8
1
6
7
Sample
10
11
12
FIGURE 22.4: The X-bar chart for the average length measurements for 6-inch nails
determined from samples of three measurements obtained every 5 mins.
the desired X-bar chart. Upon entering the data into a worksheet,
in MINITAB, the sequence Stat>Control Charts>Variables Charts
for Subgroups> X-bar> opens a self-explanatory dialog where the
problem characteristics are entered. The result is the chart shown in
Fig 22.4. Observe that the entire collection of data, the twelve average values, are all within the control limits, implying that the process
appears to be in control.
The S-Chart
The objective of the original Shewhart chart is to determine the status of
the mean value of the process/product variable
the process with respect to X,
of interest. But this is not the only process/product characteristic of interest.
may remain on target while the variability may have changed.
The average, X,
There are cases of practical importance where the variability is the primary
variable of interest, especially when we are concerned with detecting if the
process variability has changed signicantly.
Under these circumstances, the variable of interest, Y , is now SX , the
sample standard deviation, determined from the same random sample of size
The probability model is obtained from the fact that, for
n used to obtain X.
a sample size of n,
&
n
2
i=1 (Xi X)
(22.18)
SX =
n1
949
TABLE 22.1:
Length Sample
(in)
6.01
5
6.17
5
5.90
5
5.86
6
6.03
6
5.93
6
6.17
7
5.81
7
5.95
7
6.10
8
6.09
8
5.95
8
Length Sample
(in)
6.06
9
6.06
9
6.07
9
6.09
10
6.07
10
6.07
10
6.00
11
6.10
11
6.03
11
5.98
12
6.01
12
5.97
12
(22.19)
2
where X
is the inherent process variance. It can be shown, rst, that
E (SX ) = c(n)X
where the sample-size-dependent constant c(n) is given by
)
(n/2)
2
c(n) =
n 1 ( n1
2 )
(22.20)
(22.21)
(22.23)
so that:
P [(c(n)X 3SX ) < SX < (c(n)X + 3SX )] 0.99
(22.24)
950
Random Phenomena
from which the control limits for the S-chart are obtained as:
'
(
U CL = X c(n) + 3 1 c2 (n)
(
'
LCL = X c(n) 3 1 c2 (n)
(22.25)
(22.26)
S = i=1
(22.27)
j
1
(22.29)
LCL = S 1 3
c2 (n)
Whenever the computed LCL is negative, it is set to zero for the obvious reason
that standard deviation is non-negative. Finally, these somewhat intimidatinglooking computations are routinely carried out by computer programs. For
example, the S-chart for the data used in Example 22.2 is shown here in
Fig 22.5. It is obtained from MINITAB using the sequence: Stat > Control
Charts > Variables Charts for Subgroups > S >; it shows that the process variability is itself reasonably steady.
It is typical to combine the X-bar and S charts to obtain the Xbar-S
chart. This composite chart allows one to conrm that the process is both
on-target (indicated by the Xbar component) and stable (indicated by the
S component). It is possible for the process to be stable and on-target (the
preferred state); stable but not on-target; not stable and not on-target; and
less likely (but not impossible), on-target but stable. The combination XbarS chart allows the determination of which of these four possible states best
describes the process.
Variations to the Xbar-S Chart: Xbar-R, and I & MR Charts
Sometimes the process data sample size is not large enough to provide
reasonable estimates of the standard deviation, S. In such cases, the sample
951
S Chart of Nails
0.25
UCL=0.2242
Sample StDev
0.20
0.15
_
S=0.0873
0.10
0.05
LCL=0
0.00
1
6
7
Sample
10
11
12
FIGURE 22.5: The S-chart for the 6-inch nails process data of Example 22.2.
range, R (the dierence between the lowest and the highest ranked observations in the sample) is used as a measure of process variability. This gives rise
to the R-chart by itself, or the Xbar-R chart when combined with the Xbar
chart. The same principles discussed previously apply: the chart is based on
a probability model for the random variable, R; its expected value and its
theoretical variance are used to obtain the control limits.
In fact, because the data for the process in Example 22.2 involves samples
of size n = 3, it is questionable whether this sample size is sucient for
obtaining reliable estimates of sample standard deviation, . When n < 8, it
is usually recommended to use the R chart instead.
The combination Xbar-R chart for the Nails data of Example 22.2 is shown here in Fig 22.6. It is obtained from MINITAB
using the sequence: Stat > Control Charts > Variables Charts for
Subgroups > Xbar-R >. The most important points to note are: (i) the chart
still indicates that the process variability is reasonably steady; however, (ii)
the nominal value for R is almost twice that for S (this is expected, given
the denition of the range R and its relation to the standard deviation, S);
by the same token, the control limits are also wider (approximately double);
nevertheless, (iii) the general characteristics of the R component of the chart
is not much dierent from that of the S-chart obtained earlier and shown in
Fig 22.5. Thus, in this case, the S and R variables show virtually the same
characteristics.
In many cases, especially common in chemical processes, only individual
measurements are available at each sampling time. Under these circumstances,
with sample size n = 1, one can denitely plot the individual measurements
against the control limits, so that this time, the variable Y is now the actual
952
Random Phenomena
Xbar-R Chart of Nails
Sample M ean
6.2
U C L=6.1706
6.1
_
_
X=6
6.0
5.9
LC L=5.8294
5.8
1
6
7
Sample
10
11
12
U C L=0.4293
Sample Range
0.4
0.3
_
R=0.1667
0.2
0.1
LC L=0
0.0
1
6
7
Sample
10
11
12
FIGURE 22.6: The combination Xbar-R chart for the 6-inch nails process data of
Example 22.2.
process measurement X, not the average; but with no other means available
for estimating intrinsic variability, it is customary to use the moving range,
dened as:
(22.30)
M Ri = |Xi Xi1 |
the dierence between consecutive observations, as a measure of the variability. This combination gives rise to the I and MR chart (Individual and
Moving Range). The components of this chart are also determined using the
same principles as before: the individual samples are assumed to come from a
gaussian distribution, providing the probability model for the I chart, from
which the control limits are obtained, given an estimate of process variability,
. Upon assuming that individual observations are mutually independent, the
expected value and theoretical variance of the moving range are used to obtain
the control limits for the MR chart. We shall return shortly to the issue of
the independence assumption. For now we note once more that the required
computations are easily carried out with computer programs. The following
example illustrates the I and MR chart for a polymer process.
Example 22.3: CONTROL CHART FOR ELASTOMER PROCESS
Ogunnaike and Ray, (1994)1 presented in Chapter 28, hourly lab measurements of Mooney viscosity obtained for a commercial elastomer
manufactured in a continuous process. The data set is reproduced here
in Table 22.2. If the desired target Mooney viscosity value for this product is 50.0, determine whether or not the process is stable and on target.
1 B.A. Ogunnaike, and W.H. Ray, (1994). Process Dynamics, Modeling and Control,
Oxford, NY.
TABLE 22.2:
Hourly
Mooney viscosity data
Time Sequence Mooney
(in hours)
Viscosity
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
49.8
50.1
51.1
49.3
49.9
51.1
49.9
49.8
49.7
50.8
50.7
50.5
50.0
50.3
49.8
50.8
48.7
50.4
50.8
49.6
49.9
49.7
49.5
50.5
50.8
953
954
Random Phenomena
I-MR Chart of Mooney
Individual V alue
52
U C L=51.928
51
_
X=50
50
49
LC L=48.072
48
1
11
13
15
O bser vation
17
19
21
23
25
M oving Range
2.4
U C L=2.369
1.8
1.2
__
M R=0.725
0.6
LC L=0
0.0
1
11
13
15
O bser vation
17
19
21
23
25
FIGURE 22.7: The combination I-MR chart for the Mooney viscosity data.
Solution:
Because we only have individual observations at each sampling time, this
calls for an I and MR chart. Such a chart is obtained from MINITAB
using the sequence: Stat > Control Charts > Variables Charts for
Individual > I-MR >. The result is shown in Fig 22.7, which indicates
that the process is in statistical control.
The P-Chart
When the characteristic of interest is the proportion of defective items in
a sample, the appropriate chart is known as the P -chart. If X is the random
variable representing the number of defective items in a random sample of size
n, then we know from Chapter 8 that X possesses a binomial distribution, in
this case,
n x
(22.31)
f (x) =
(1 )nx
x
where is the true but unknown population proportion of defectives. The
maximum likelihood estimator,
P =
X
n
(22.32)
(22.33)
(22.34)
955
so that
E(P ) =
P2
(1 )
n
(22.35)
(22.36)
i=1
pi
p =
n
(22.37)
(22.38)
These results can be used to construct the P -chart to monitor the proportion of defectives in a manufacturing process. The center line is the traditional
long term average p, and the 3-sigma control limits are:
)
p(1 p)
U CL = p + 3
(22.39)
n
)
p(1 p)
(22.40)
LCL = p 3
n
Once again, negative values of LCL are set to zero. The following example
illustrates the P -chart.
Example 22.4: IDENTIFYING SPECIAL CAUSE IN MECHANICAL PENCIL PRODUCTION
A mechanical pencil manufacturer takes a sample of 10 every shift and
tests the lead release mechanism. The pencil is marked defective if the
lead release mechanism does not function as prescribed. Table 22.3 contains the results from 10 consecutive shifts during a certain week in the
summer; it shows the sample size, the number of defective pencils identied and the proportion defective. Obtain a control chart for the data
and assess whether or not the manufacturing process is in control.
Solution:
The P -chart obtained from these data using MINITAB (sequence: Stat
> Control Charts > Attributes Charts > P >) is shown in Fig 22.8
where one point, the entry from the 9th shift, falls outside of the UCL.
The MINITAB output is as follows:
Test Results for P Chart of Ndefective
TEST 1. One point more than 3.00 standard deviations from center
956
Random Phenomena
TABLE 22.3:
Number and
proportion of defective mechanical pencils
Shift
Sample
Size
Number
defective
Proportion
defective
1
2
3
4
5
6
7
8
9
10
10
10
10
10
10
10
10
10
10
10
0
0
0
0
2
2
1
0
4
0
0.0
0.0
0.0
0.0
0.2
0.2
0.1
0.0
0.4
0.0
P Chart of Ndefective
1
0.4
UCL=0.3615
Proportion
0.3
0.2
0.1
_
P=0.09
0.0
LCL=0
1
5
6
Sample
10
FIGURE 22.8: P-chart for the data on defective mechanical pencils: note the 9th
observation that is outside the UCL.
957
line.
Test Failed at points: 9
Upon further review, it was discovered that during shift 9 (Friday
morning) was when a set of new high school summer interns were being
trained on how to run parts of the manufacturing process; the mistakes
made were promptly rectied and the process returned to normal by
the end of shift 10.
The C-Chart
When the process/product characteristic of interest is the number of defects per item (for example, the number of inclusions on a glass sheet of given
area, as introduced in Chapter 1 and revisited several times in ensuing chapters), the appropriate chart is the C-chart. This chart, like the others, is developed on the basis of the appropriate probability model, which in this case,
is the Poisson model. This is because X, the random variable representing the
number of defects per item is Poisson-distributed, with pdf
f (x) =
e x
x!
(22.41)
=
=
(22.42)
(22.43)
(22.44)
= i=1
k
from where the 3-sigma control limits are obtained as:
+3
U CL =
(22.45)
3
(22.46)
LCL =
setting LCL = 0 in place of negative values.
An example C-Chart for the inclusions data introduced in Chapter 1 (Table
1.2), is shown in Fig 22.9, obtained from MINITAB using the sequence Stat
> Control Charts > Attributes Charts > C >. The lone observation of 5
inclusions (in the 33rd sample) is agged as out of limit; otherwise, the process
seems to be operating in control, with an average number of inclusions of
approximately 1, and an upper limit of 4 (see Eq (22.45) above).
If we recall the discussion in Chapter 15, especially Example 15.15, we note
958
Random Phenomena
C Chart of Inclusions
1
UCL=4.042
Sample Count
2
_
C=1.017
LCL=0
1
13
19
25
31
Sample
37
43
49
55
FIGURE 22.9: C-chart for the inclusions data presented in Chapter 1, Table 1.2, and
discussed in subsequent chapters: note the 33rd observation that is outside the UCL,
otherwise, the process appears to be operating in statistical control
that this C-chart is nothing but a visual, graphical version of the hypothesis
test carried out in that example. We concluded then that the process was on
target (at that time, at the 95% condence level); we reach the same conclusion
with this chart, at the 99% condence level.
22.3.4
Enhancements
Motivation
Basic SPC charts, as originally conceived, needed enhancing for several
reasons, the three most important being:
1. Sensitivity to small shifts;
2. Serial correlation;
3. Multivariate data
It can be truly challenging for standard SPC charts to detect small changes.
This is because of the very low -risk ( = 0.003 compared to = 0.05 used for
hypothesis tests) chosen to prevent too many false out-of-control alarms. The
natural consequence is that the -risk of failing to identify an out-of-control
situation increases. To illustrate, consider the Mooney viscosity data shown
in Table 22.2; if a step increase in Mooney viscosity of 0.7 occurs after sample
15 and persists, a sequence plot of both the original data and the shifted data
is shown in Fig 22.10, where the shift is clear. However, an I-Chart for the
shifted data, shown in Fig 22.11, even after specifying the population standard
959
51.5
51.0
Data
50.5
50.0
49.5
49.0
10
12 14
Index
16
18
20
22
24
FIGURE 22.10: Time series plot of the original Mooney viscosity data of Fig 22.7 and
Table 22.2, and of the shifted version showing a step increase of 0.7 after sample 15.
deviation as = 0.5 (less than the value of approximately 0.62 used for the
original data in Fig 22.7), is unable to detect the shift.
Techniques employed to improve the sensitivity to small changes, include
the Western Electric Rules2 , the CUSUM (Cumulative Sum) chart, and the
EWMA (Exponentially Weighted Moving Average) chart, which will all be
discussed shortly.
The issue of serial correlation is often a key characteristic of industrial
chemical processes where process dynamics are signicant. Classical SPC assumes no serial correlation in process data, and that mean shifts occur only
due to infrequent special causes. The most direct way to handle this type of
process variability is Engineering/Automatic Process Control. We will also
introduce this briey.
Finally, industrial processes are intrinsically multivariable so that the data
used to track process performance come from several process variables, sometimes numbering in the hundreds. These process measurements are such that
if, for example, y1 = {y11 , y12 , . . . , y1n } represents the sequence of observations for one variable, say reactor temperature, and there are others just like it,
yk = {yk1 , yk2 , . . . , ykn }, sequences from other variables, k = 2, 3, . . . , m, (say
reactor pressure, agitator amps, catalyst ow rate, etc.) then the sequences yj
and y with j = are often highly correlated. Besides this, it is also impossible
to visualize the entire data collection properly with the usual single, individual variable SPC charts. Special multivariate techniques for dealing with this
type of process variability will be discussed in Chapter 23.
2 Western Electric Company (1956), Statistical Quality Control Handbook. (1st Edition.),
Indianapolis, Indiana.
960
Random Phenomena
I Chart of M_Shift
51.5
UCL=51.5
Individual Value
51.0
50.5
_
X=50
50.0
49.5
49.0
LCL=48.5
48.5
1
11 13
15
Observation
17
19
21
23
25
FIGURE 22.11: I-chart for the shifted Mooney viscosity data. Even with = 0.5, it
is not sensitive enough to detect the step change of 0.7 introduced after sample 15.
n
i=1
(Xi 0 )
(22.47)
961
This quantity has the following distinctive characteristics: (i) random variations around the target manifest as a random walk, an accumulation of
small, zero mean, random errors; on the other hand, (ii) if there is a shift in
mean valueno matter how slightand it persists, this event will eventually
translate to a noticeable change in character, an upward trend for a positive
shift, or a downward trend for a negative shift. As a result of the persistent
accumulation, the slope of these trends will be related to the magnitude of
the change.
CUSUM charts, of which there are two types, are based on the probabilistic
characterization of the random variable, Sn . The one-sided CUSUM charts are
plotted in pairs: an upper CUSUM to detect positive shifts (an increase in the
process variable value), and the lower CUSUM to detect negative shifts. The
control limits, UCL and LCL, are determined in the usual fashion on the basis
of the appropriate sampling distribution (details not considered here). This
version is usually preferred because it is easier to construct and to interpret. It
is also possible to obtain a single two-sided CUSUM chart. Such a chart uses
a so-called V-mask instead of the typical 3-sigma control limits. While the
intended scope of this discussion does not extend beyond this brief overview,
additional details regarding the CUSUM chart are available, for example, in
Page (1961)3 , and Lucas (1976)4.
Fig 22.12 shows the two one-sided CUSUM charts corresponding directly to the I-Chart of Fig 22.11, with the standard deviation specied
as 0.5, and the target as 50. (The chart is obtained from MINITAB with
the sequence: Stat > Control Charts > Time-Weighted Charts > CUSUM
>). The upper CUSUM for detecting positive shifts is represented with dots;
the lower CUSUM with diamonds, and the non-conforming data with squares.
Note that very little activity is manifested in the lower CUSUM. This is in
contrast to the upper CUSUM where the inuence of the introduced step
change is identied after sample 18, barely three samples after its introduction. Where the I-Chart based on individual observations is insensitive to such
small changes, the amplication eect of the error accumulation implied in Eq
22.47 has made this early detection possible.
For the sake of comparison, Fig 22.13 shows the corresponding one-sided
CUSUM charts for the original Mooney viscosity data, using the same characteristics as the CUSUM charts in Fig 22.12; no point is identied as nonconforming, consistent with the earlier analysis of the original data.
EWMA Charts
Rather than plot the individual observation Xi , or the cumulative sum
shown in Eq (22.47), consider instead the following variable, Zi , dened by:
Zi = wXi + (1 w)Zi1 ; 0 w 1
3 E.S.
(22.48)
962
Random Phenomena
Cumulative Sum
4
3
2
UCL=2
1
0
0
-1
LCL=-2
-2
-3
1
11
13
15
Sample
17
19
21
23
25
FIGURE 22.12: Two one-sided CUSUM charts for the shifted Mooney viscosity data.
The upper chart uses dots; the lower chart uses diamonds; the non-conforming points are
represented with the squares. With the same = 0.5, the step change of 0.7 introduced
after sample 15 is identied after sample 18. Compare with the I-Chart in Fig 22.11.
UCL=2
Cumulative Sum
-1
-2
LCL=-2
1
11
13
15
Sample
17
19
21
23
25
FIGURE 22.13: Two one-sided CUSUM charts for the original Mooney viscosity data
using the same characteristics as those in Fig 22.12. The upper chart uses dots; the
lower chart uses diamonds; there are no non-conforming points.
963
964
Random Phenomena
EWMA Chart of M_Shift
50.75
EWMA
50.50
UCL=50.500
50.25
_
_
X=50
50.00
49.75
49.50
LCL=49.500
1
11
13 15
Sample
17
19
21
23
25
FIGURE 22.14: EWMA chart for the shifted Mooney viscosity data, with w = 0.2.
Note the staircase shape of the control limits for the earlier data points. With the same
= 0.5, the step change of 0.7 introduced after sample 15 is detected after sample 18.
The non-conforming points are represented with the squares. Compare with the I-Chart
in Fig 22.11 and the CUSUM charts in Fig 22.12.
for the original Monney viscosity data is shown in Fig 22.15; as expected, no
non-conforming points are identied.
22.4
22.4.1
For chemical processes, where what is to be controlled are process variables such as temperature, pressure, ow rate, and liquid level, in addition
to product characteristics such as polymer viscosity, co-polymer composition,
mole fraction of light material in an overheard distillation product, etc., the
concept of process control takes on a somewhat dierent meaning. Let us
begin by representing the variable to be controlled as:
y(k) = (k) + e(k)
(22.50)
where (k) is the true but unknown value of process variable; e(k) is noise,
consisting of measurement error, usually random, and other unpredictable
components; k = 1, 2, 3, . . . is a sampling time index.
The objective in chemical process control is to maintain the process variable as close as possible to its desired target value yd (k), in the face of possible
965
UCL=50.500
EWMA
50.25
_
_
X=50
50.00
49.75
49.50
LCL=49.500
1
11
13 15
Sample
17
19
21
23
25
FIGURE 22.15: The EWMA chart for the original Mooney viscosity data using the
same characteristics as in Fig 22.14. There are no non-conforming points.
systematic variations in the true value (k), and inherent random variations
in the observed measurements y(k). There are two fundamentally dierent
perspectives of this problem, each leading naturally to a dierent approach
philosophy:
1. The Statistical Process Control (SPC) perspective; and
2. The Engineering (or Automatic) Process Control (APC) perspective.
22.4.2
The SPC techniques discussed previously take the following view of the
control problem: (k) tends to be constant, on target, with infrequent abrupt
shifts in value; the shifts are attributable to special causes to be identied
and eliminated; the noise term, e(k), tends to behave like independently and
identically distributed zero mean, gaussian random variables. Finally, taking
control action is costly and should be done only when there is sucient evidence in support of this need for action.
Taken to its logical conclusion, the implications of such a perspective is the
following approach philosophy: observe each y(k) and analyze for infrequent
shifts; take action only if a shift is detected with a pre-specied degree of
condence. It is therefore not surprising that the tools of SPC are embodied
in the charts presented above: Shewhart, CUSUM, EWMA, etc. But it must
be remembered that the original applications were in the discrete-parts manufacturing industry where the assumptions regarding the problem components
as viewed from the SPC perspective are more likely to be valid.
966
22.4.3
Random Phenomena
The alternative APC perspective is that, left unattended, (k) will wander
and not remain on-target because of frequent, persistent, unavoidable and unmeasured/unmeasurable external disturbances that often arise from unknown
sources; that regardless of underlying statistics, the contribution of the randomly varying noise term, e(k), to the observation, y(k), is minor compared to
the contributions due to natural process dynamics and uncontrollable external
disturbance eects. For example, the eect of outside temperature variations
on the temperature prole in a renerys distillation column that rises 130
ft into the open air will swamp any variations due to random thermocouple
measurement errors. Finally, there is little or no cost associated with taking
control action. For example, in response to an increase in the summer afternoon temperature, it is not costly to open a control valve to increase the
cooling water ow rate to a distillation columns condenser; what is costly
is not increasing the cooling and therefore allowing expensive light overhead
material to be lost in the vent stream.
The natural implication of such a perspective is that every observed deviation of each y(k) from the desired yd (k) is considered signicant; as a result of
which control action is implemented automatically at every sampling instant,
k, according to a pre-designed control equation,
u(k) = f ((k))
(22.51)
(22.52)
where
is the feedback error indicating the discrepancy between the observation
y(k) and its desired target value, yd (k), and f (.) is a control law that is based
on specic design principles. For example, the standard (continuous time)
Proportional-Integral-Derivative (PID) controllers operate according to
d(t)
1 t
(v)dv + D
u(t) = Kc (t) +
(22.53)
I 0
dt
where Kc , I and D are controller parameters chosen to achieve desired controller performance (see, for example, Ogunnaike and Ray, (1994) referenced
earlier).
The engineering/automatic control philosophy therefore is to transfer variability from where it will hurt to where it will not and as quickly as
possible. This is to be understood in the sense that variability transmitted to
the process variable, y(k), (for example, the distillation columns temperature
prole) if allowed to persist will adversely aect (hurt) product quality; by
adjusting the manipulated variable u(k) (for example, cooling water ow rate)
in order to restore y(k) to its desired value, the variability that would have
been observed in y(k) is thus transferred to u(k), which usually is not a problem. (Variations in cooling water ow rate is typically not much of a concern
967
(22.54)
and with e(k) N (0, 2 ), the minimum variance control law is shown
to be equivalent to the Shewhart charting paradigm;
2. If, instead of the zero mean gaussian noise model for e(k) above, the
disturbance model is
d(k) = d(k 1) + e(k) e(k 1)
(22.55)
(22.56)
or second order,
(k) = a1 (k 1) + a2 (k 2) + b1 u(k) + b2 u(k 2) + d(k) (22.57)
where the disturbance model is as in Eq (22.55) above, then the minimum variance controller is shown to be exactly the discrete time version
of the PID controller shown in Eq (22.53).
Additional details concerning these matters lie outside the intended scope
of this chapter and the interested reader may consult the indicated reference.
One aspect of that discussion that will be summarized here briey concerns
deciding when to choose one approach or the other.
22.4.4
SPC or APC
The following three basic process and control attributes play central roles
in choosing between the SPC approach or the APC approach:
968
Random Phenomena
1. Sampling Interval: This refers to how frequently the process output variable is measured; it is a process attribute best considered relative to the
natural process response time. If a process with natural response time on
the order of hours is sampled every minute, the dynamic characteristics
of the process will be evident in the measurements and such measurements will be correlated in time. On the other hand, data sampled every
hour from a process with a natural response time of minutes will most
likely not show any dynamic characteristics and the observations are
more likely to be uncorrelated.
2. Noise (or disturbance) character: This refers to the random or unexplainable part of the data. Is this due mainly to purely random variation
with an occasional special cause shift, or is it due to systematic drifts
and frequent special cause shifts?
3. Cost of implementing control action: Are control adjustments costly or
mostly cost-free? This is the dierence between shutting down a wafer
polishing machine to adjust its settings versus opening and closing a
control valve on a cold water supply line to a rectors cooling jacket.
969
22.5
If acceptance sampling is taking action to rectify quality issues after manufacturing is complete, and process control is taking action during manufacturing, process and parameter design is concerned with taking action preemptively before manufacturing. According to this paradigm, wherever possible
(and also economically feasible), operating parameters should be chosen to
minimize the eects of uncontrollable factors that inuence product quality
variability. (If it is avoidable and the cost is not prohibitive, why expose a
distillation column to the elements?)
22.5.1
Basic Principles
The primary ideas and concepts, due to Genichi Taguchi (born 1924),
a Japanese engineer and statistician, involve using design of experiments to
improve process operation ahead of time, by selecting those process parameters
and operating conditions that are most conducive to robust process operation.
With this paradigm, process variables are classied as follows:
1. Response variables: The variables of interest; quality indicators;
970
Random Phenomena
2. Control factors: The variables that aect the response; their levels can
be decided by the experimenter, or process operator. (The reader familiar with chemical process control will recognize these variables as the
manipulated variables);
3. Noise factors: The variables that aect the response but are not controllable. (Again, readers familiar with chemical process control will recognize these as the disturbance variables.)
Taguchi techniques are concerned with answering the question:
At what level of control factors is the process response least susceptible to the eect of noise factors?
22.5.2
A Theoretical Rationale
(22.58)
6 Taguchi, G. and Konishi, S., (1987), Orthogonal Arrays and Linear Graphs, Dearborn,
MI, ASI Press.
7 Ross, P.J. (1996). Taguchi Techniques for Quality Engineering, McGraw Hill, NY.
971
the deviation of this random variable from the target. D is itself a random
variable with expected value, , i.e.,
E(D) =
(22.59)
= E(D) = E(Y ) y = Y y
(22.60)
(22.61)
as the loss incurred by Y deviating from the desired target (in this case, a
squared error, or quadratic, loss function). By introducing Eq (22.58) into Eq
(22.61), we obtain:
"
!
L(y) = E (Y y )2
2
= E [(Y Y ) + (Y y )]
=
Y2 + 2
(22.62)
Bias
Traditional SPC and/or engineering APC is concerned with, and can only
deal with, the second term, by attempting to drive Y to y and hence eliminate the bias. Even if this can be achieved perfectly, the rst term, Y2 , still
persists. Alternatively, this decomposition shows that the minimum achievable value for the loss function (achieved when Y = y ) is Y2 . Thus even
with a perfect control scheme, L(y) = Y2 . Taguchis methodology is aimed at
nding parameter settings to minimize this rst term, by design. When the
process is therefore operated at these optimum conditions, we can be sure that
the minimum loss achieved with eective process control is the best possible.
Without this, the manufacturer, even with the best control system, will incur
product quality losses that cannot be compensated for any other way.
22.6
As with the previous chapter, this chapter has also been primarily concerned with showcasing one more application of probability and statistics that
972
Random Phenomena
has evolved into a full-scale subject matter in its own right. In this particular
case, it is arguable whether the unparalleled productivity enjoyed by modern manufacturing will be possible without the tools of quality assurance and
control, making the subject matter of this chapter one of the most inuential
applications of probability and statistics of the last century.
The presentation in three distinct units was a deliberate attempt to place
in some historical perspective, the techniques that make up the core of quality
assurance and control. Yet, fortuitously (or not) this demarcation also happens
to coincide precisely with where along the manufacturing process time-line the
technique in question is applicable. Thus, what we discussed rst, acceptance
sampling, with its post-production focus and applicability, is almost entirely
a thing of the past (at least as a stand-alone quality assurance strategy). Our
subsequent discussion of process control, the during-production strategy,
covered the historical and modern incarnation of control charts, augmented
with a brief summary of the automatic control paradigmwhich included
a discussion of how the two apparently opposite philosophies are really two
perspectives of the same problem. In the nal unit, we only provided the
briefest of glances at the Taguchi techniques of parameter design, the preproduction strategy, choosing instead to emphasize the basic principles and
rationale behind the techniques. Regardless of the successes of pre-production
designs, however, process control will forever be an intrinsic part of manufacturing; the implementation ideas and techniques may advance, but the basic
concept of monitoring process performance and making real-time, in-process
adjustments to maintain control in the face of unavoidable, unpredictable, and
potentially destabilizing variability, will always be a part of modern manufacturing.
We note, in closing, that because these quality control techniques arose
from industrial needs, and were therefore developed exclusively for industrial
manufacturing processes, they are so completely enmeshed with industrial
practice that acquiring a true practical appreciation outside of the industrial
environment is virtually impossible. To approximate the industrial experience
of applying these techniques (especially to experience, rst-hand, the realtime, sequential-in-time data structure that is intrinsic to these methods)
we oer a few project assignments here in place of the usual exercises and
applications problems.
REVIEW QUESTIONS
1. To what objective is the subject matter of quality assurance and control devoted?
2. What characteristic of manufacturing processes makes quality assurance and control mandatory?
3. What are the three problems associated with assuring the quality of mass produced products? Which one did traditional quality assurance focus on?
973
974
Random Phenomena
975
PROJECT ASSIGNMENTS
1. Tracking the Dow
Even though many factorssome controllable, some not; some known, some
unknowncontribute to the daily closing value of the Dow Jones Industrial Average
(DJIA) index, it has been suggested that at a very basic level, the change in closing
value from day to day in this index is distributed approximately as a zero mean
random variable. Precisely what the distribution ought to be remains a matter of
some debate.
Develop an SPC chart to track (k) = )k) (k 1), where (k) is the closing
value of the Dow average on day k, with (k 1) as the previous days value. From
historical data during a typical period when the markets could be considered stable, determine base values for and , or else assume that = 0 theoretically so
that the chart will be used to identify any systemic departure from this postulated
central value. Use the value estimated for (and a postulated probability model)
to set the control limits objectively; track (k) for 2 months with this chart. Should
any point fall out of limits during this period, determine assignable causes where
possible or postulate some. Present your results in a report.
Here are some points to consider about this project:
1. The I & MR and other similar charts are based on an implicit Gaussian
distribution assumption. There have been arguments that nancial variables
such as (k) are better modeled by the heavier-tailed Cauchy distribution (see
Tanaka-Yamawaki (2003)8 and Problem 18.16 in Chapter 18). Consider this
in setting control limits and in using the limits to decide which deviations
are to be considered as indicative of out-of-control uctuation in the Dow
average.
2. A possible alternative to consider is the EWMA chart which, as a moving
average, is more likely to dampen excessive, but still typical uctuations.
3. The DJIA is not the only index of the nancial markets; in fact, many analysis
8 Mieko Tanaka-Yamawaki, (2003). Two-phase oscillatory patterns in a positive feedback
agent model Physica A 324, 380387
976
Random Phenomena
argue that it is too narrow; that the Standard and Poors (S & P) 500 index
provides a better gauge on the market. If time permits, consider a second
chart simultaneously for the S & P 500 and assess whether or not the two
indexes exhibit similar characteristics.
9 Bequette, B. W. and J. Desemone, (2004). Intelligent Dosing System: Need for Design
and Analysis Based on Control Theory, Diabetes Technology & Therapeutics, 6(6): 868-873
10 Zisser, H., Jovanovic L., Doyle, III F. J., Ospina P., Owens C. (2005), Run-to-run
control of mealrelated insulin dosing, Diabetes Technol Ther ; 7(1):48-57
Chapter 23
Introduction to Multivariate Analysis
978
978
979
980
982
982
983
984
985
985
986
986
987
989
990
990
991
991
991
999
999
1000
1002
1003
1004
1004
978
Random Phenomena
23.1
23.1.1
A multivariate probability model is the joint pdf of the multivariate random variable (or random vector) X = (X1 , X2 , . . . , Xn ). Conceptually, it is
a direct extension of the single variable pdf, f (x). As dened in Chapter 5,
each component Xi of the random vector is itself a random variable with its
own marginal pdf, fi (xi ); and in the special case when these component random variables are independent, the joint pdf is obtained as a product of these
individual marginal pdfs, i.e.,
f (x) =
n
fi (xi )
(23.1)
i=1
as we saw in Chapters 13 and 14 while discussing sampling and estimation theory. In general, however, these constituent elements Xi are not independent,
and the probability model will be more complex.
Here are some practical examples of multivariate random variables:
979
1. Student SAT scores: The SAT score for each student is a triplet of numbers: V , the score on the verbal portion, Q, the score on the quantitative
portion, and W , the score on the writing portion. For a population of
students, the score obtained by any particular student is therefore a
three-dimensional random variable with components X1 , X2 and X3
representing the individual scores on the verbal, quantitative and writing portions of the test respectively. Because students who do well in
the verbal portion also tend to do just as well in the writing portion, X1
will be correlated with X3 . The total score, T = X1 + X2 + X3 is itself
also a random variable.
2. Product quality characterization of glass sheets: Consider the case where
the quality of manufactured glass sheets sold specically into certain
markets is characterized in terms of two quantities: X1 , representing the
number of inclusions found on the sheet (see Chapter 1); X2 , representing warp, the extent to which the glass sheet is not at (measured as
an average curvature angle). This product quality variable is therefore a
two-dimensional random variable. Note that one of the two components
is discrete while the other is continuous.
3. Market survey: Consider a market evaluation of a several new products
against their respective primary incumbent competitors: each subject
participating in the market survey compares two corresponding products
and gives a rating of 1 to indicate a preference for the new challenger, 0 if
indierent, and 1 if the incumbent is preferred. The result of the market
survey for each new product is the three-dimensional random variable
with components X1 , the number of preferences for the new product,
X2 , the number of indierents, and X3 , the number of preferences for
the incumbent.
The multinomial model is an example of a multivariate probability model; it
was presented in Chapter 8, as a direct extension of the binomial probability
model. We now present some other important multivariate probability models.
A more complete catalog is available in Kotz et al. (2000)1
23.1.2
1
(2)p/2 ||1/2
1
exp (x )T 1 (x )
2
(23.2)
1 S. Kotz, N. Balakrishnan, and N. L. Johnson, (2000). Continuous Multivariate Distributions, Wiley, New York.
980
Random Phenomena
where and are, respectively, the mean vector and the covariance matrix
of the random vector, X, dened by:
= E(X)
(23.3)
!
"
= E (X )(X )T
(23.4)
=
1
2
12
21
(23.5)
12
22
(23.6)
(23.7)
1
exp{U }
21 2 1 2
(23.8)
with
(x1 1 )2
1
(x2 2 )2
(x1 1 )(x2 2 )
U=
+
2
2(1 )2
12
22
1 2
(23.9)
Fig 23.1 shows plots of the bivariate Gaussian distribution for = 0 (top) and
= 0.9 (bottom), respectively. When = 0, the two random variables are
uncorrelated and the distribution is symmetric in all directions; when = 0.9,
the random variables are strongly positively correlated and the distribution is
narrow and elongated along the diagonal.
981
U 0
X2
X1
U 0.9
X2
X1
FIGURE 23.1: Examples of the bivariate Gaussian distribution where the two random
variables are uncorrelated ( = 0) and strongly positively correlated ( = 0.9).
23.1.3
n
Xi XTi
(23.10)
i=1
(23.11)
where (.) is the Gamma function, || indicates the matrix determinant, and
T r() the trace of the matrix.
The Wishart distribution is a multivariate generalization of the 2 distribution. For the single variable case, where p = 1 and = 1, the Wishart
distribution reduces to the 2 (n) distribution. The expected value of V is:
E(V) = n
(23.12)
982
Random Phenomena
statistical analysis is played by the Wishart distribution in multivariate analysis. Additional information about this distribution is available in Mardia, et
al. (1979)2 .
23.1.4
Let the vector x be a realization of a p-variate Gaussian random variable, M Np (, ), and let S be the sample covariance matrix obtained from
n samples of the p elements of this vector; i.e.,
1
)(xi x
)T
(xi x
n 1 i=1
n
S=
(23.13)
=
x
(23.14)
1
n Wp (, n1). The
T 2 = n(
x )T S1 (
x )
statistic
(23.15)
(23.16)
23.1.5
983
respective Wishart distributions, Wp (I, n) and Wn (I, m); i.e., each associated
covariance matrix is the identity matrix, I. Then the ratio , dened as:
=
|U|
|U| + |V|
(23.17)
has the Wilks (p, n, m) distribution. It can be shown (see the Mardia et
al., reference (Reference 2)) that this distribution can be obtained as the
distribution of a product of independent beta B(, ) random variables i
where
p
m+ip
;=
(23.18)
=
2
2
for m p, i.e., if i B(, ), then
m
i (p, n, m)
(23.19)
i=1
What the F -distribution is to the Students t-distribution in univariate analysis, the Wilks distribution is to Hotellings T 2 distribution.
23.1.6
f (x) =
x1 + x2 + + xk < 1
(23.23)
i=1 i
984
Random Phenomena
i ( i )
2 ( + 1)
(23.25)
i j
2 ( + 1)
(23.26)
23.2
X1
x11 x12 . . . x1n
X2 x21 x22 . . . x2n
(23.27)
.. = ..
..
..
..
. .
.
.
.
Xm
xm1
xm2
. . . xmn
985
23.3
986
23.3.1
Random Phenomena
Data Preconditioning
PCA is scale-dependent, with larger numerical values naturally accorded
more importance, whether deserved or not. To eliminate any such undue inuence (especially those arising from dierences in the units in which dierent
variables are measured), each data record can mean-centered and scaled
(i.e., normalized) prior to carrying out PCA. But this is problem dependent.
Let the original data set consist of n-column vectors x1 , x2 , . . . xn , each
containing m samples, giving rise to the raw data matrix X . If the mean and
standard deviation for each column are x
i , si , then each variable i = 1, 2, . . . , n,
is normalized as follows:
x x
i
xi = i
(23.28)
si
The resulting m n pre-treated data matrix, X, consisting of columns of data
each with zero mean and unit variance.
Problem Statement
We begin by contemplating the possibility of an orthogonal decomposition
of X by expanding in a set of n-dimensional orthonormal basis vectors, p1 ,
p2 , p3 , . . ., pn , according to:
X = t1 pT1 + t2 pT2 + . . . + tn pTn
in such a way that
pTi pj =
along with
tTi tj =
1 i=j
0 i=
j
i
0
i=j
i = j
(23.29)
(23.30)
(23.31)
(23.32)
n
ti pTi
(23.33)
i=k+1
contains only random noise, such an expansion would have then provided a kdimensional reduction of the data. It implies that k components are sucient
to capture all the useful variation contained in the data matrix.
The basis vectors, p1 , p2 , p3 , . . ., pk , are commonly referred to as loading
vectors; they provide an alternative coordinate system for viewing the data.
987
(23.34)
(23.35)
(23.36)
We now seek the ti and pi vectors that minimize the squared norm of the
appropriate matrix n or m . Which matrix is appropriate depends on the
dimensionality of the vector over which we are carrying out the minimization.
First, to determine each n-dimensional vector pi , since
n = (Xt1 pT1 t2 pT2 . . .tk pTk )T (Xt1 pT1 t2 pT2 . . .tk pTk ) (23.37)
we may dierentiate with respect to the n-dimensional vector pTi obtaining:
n
= 2tTi (X t1 pT1 t2 pT2 . . . tk pTk )
pTi
(23.38)
which, upon setting to the n-dimensional row vector of zeros, 0T , and simplifying, yields
n
= tTi X i pTi = 0T
(23.39)
pTi
where the simplication arises from the orthogonality requirements on ti (see
Eq (23.31)). The solution is:
tT X
(23.40)
pTi = i
i
988
Random Phenomena
Note that this result is true for all values of i, and is independent of k; in
other words, we would obtain precisely the same result regardless of the chosen
truncation. This property is intrinsic to PCA and common with orthogonal
decompositions; it is exploited in various numerical PCA algorithms. The real
challenge at the moment is that Eq (23.40) requires knowledge of ti , which
we currently do not have.
Next, to determine the m-dimensional vectors ti , we work with m ,
m = (Xt1 pT1 t2 pT2 . . .tk pTk )(Xt1 pT1 t2 pT2 . . .tk pTk )T (23.41)
and dierentiate with respect to the m-dimensional vector ti to obtain:
m
= 2(X t1 pT1 t2 pT2 . . . tk pTk )pi
ti
(23.42)
pTi XT X
i
(23.45)
= i pTi
= 0
(23.46)
(23.47)
This equation is immediately recognized as the eigenvalue-eigenvector equation, with the following implications:
989
(23.48)
T = XP
(23.49)
or, simply
as the principal component transform of the data matrix X. The k transformed variables T are called the principal components scores.
The corresponding inverse transform is obtained from (23.29) as:
= TPT
X
(23.50)
= X only if k = n; otherwise X
is a cleaned up, lower-rank version
with X
of X. The dierence,
E = XX
(23.51)
is the residual matrix; it represents the residual information contained in the
portion of the original data associated with the (n k) loading vectors that
were excluded from the transformation.
6 Malinowski,
990
Random Phenomena
23.3.2
(23.52)
(23.53)
n
(23.54)
i=1
n
(23.55)
i=1
If, therefore, the matrix XT X = R is of rank r < n, then only r eigenvalues are non-zero; the rest are zero, and the determinant will be zero.
The matrix will therefore be singular (and hence non-invertible).
4. When an eigenvalue is not precisely zero, just small, the cumulative contribution of its corresponding principal component to the overall variation in the data will likewise be small. Such component may then be
ignored. Thus, by dening the cumulative contribution of the j th principal component as:
j
i
j = i=1
(23.56)
n
i=1 i
one may choose k such that k+1 does not add much beyond k .
Typically a plot of j versus j, known as a Scree plot, shows a knee
at or after the point j = k (see Fig 23.3).
991
23.3.3
Illustrative example
Even to veterans of multivariate data analysis, the intrinsic linear combinations of variables can make PCA and its results somewhat challenging to
interpret. This is in addition to the usual plotting and visualization challenge
arising from the inherent multidimensional character of such data analysis.
The problem discussed here has been chosen therefore specically to demonstrate what principal components, scores and loadings mean in real applications, but in a manner somewhat more transparent to interpret8 .
Problem Statement and Data
The problem involves 100 samples obtained from 16 variables, Y1 , Y2 , . . . , Y16 ,
to form a 100 16 raw data matrix, a plot of which is shown in Fig 23.2. The
primary objective is to analyze the data to see if the dimensionality could
be reduced from 16 to a more manageable number; and to see if there are
any patterns to be extracted from the data. Before going on, the reader is
encouraged to take some time and examine the data plots for any visual clues
regarding the characteristics of the complete data set.
8 The
992
Random Phenomena
0
y1
100
y2
0.0
-0.8
-1.6
50
-3
y5
y3
50
100
y4
0
0
-3
y6
-6
-4
y7
y8
0.5
0.0
-1.0
-0.8
-1
-2.5
y9
y10
0.5
0
-2
-1.6
y11
y12
0.5
-0.5
-1
y13
y14
5
0.5
0
-0.5
-1.5
-5
0
50
-1.0
-1.5
-2
100
-2.5
y15
y16
0.5
0.0
-0.5
-3
0
50
100
Index
FIGURE 23.2: Plot of the 16 variables in the illustrative example data set.
993
PC1
0.273
0.002
0.028
-0.016
0.306
-0.348
0.285
0.351
0.326
0.346
0.019
0.344
0.347
0.036
-0.007
-0.199
PC2
0.005
-0.384
0.411
0.381
-0.031
0.001
0.002
-0.018
0.027
-0.020
0.427
-0.025
-0.014
0.427
-0.412
0.030
PC3
-0.015
0.151
0.006
0.155
-0.180
0.055
-0.247
-0.030
-0.014
-0.066
-0.002
0.000
0.011
0.066
0.023
-0.920
PC4
0.760
-0.253
-0.111
0.239
-0.107
0.119
0.312
-0.029
-0.252
-0.063
-0.098
-0.117
-0.193
-0.116
0.142
-0.068
PC5
0.442
0.396
0.125
0.089
0.407
0.009
-0.631
0.019
-0.100
-0.028
0.042
0.029
-0.027
0.124
-0.014
0.175
PC6
0.230
-0.089
0.151
-0.833
0.077
0.172
0.054
0.007
-0.025
-0.037
0.238
-0.198
-0.047
0.070
-0.219
-0.175
These results are best appreciated graphically. First, Fig 23.3 is a Scree
plot, a straightforward plot of the eigenvalues in descending order. The primary characteristic of this plot is that it shows graphically how many eigenvalues (and hence principal components) are necessary to capture most of the
variability in the data. This particular plot shows that after the rst two components, not much else is important. The actual numbers in the eigenanalysis
table show that almost 80% of the variability in the data is captured by the
rst two principal components; the third principal component contributes less
than 5% following the 30% contribution from the second principal component.
This is reected in the very sharp knee at the point k + 1 = 3 in the Scree
plot. The implication therefore is that the information contained in the 16
variables can be represented quite well using two principal components, PC1
and PC2, shown in the MINTAB output table.
If we now focus on the rst two principal components and their associated
scores and loadings, the rst order of business is to plot these to see what
insight they can oer into the data. The individual scores and loading plots
are particularly revealing for this particular problem. Fig 23.4 shows such a
plot for the rst principal component. It is important to remember that the
scores indicate what the new data will look like in the transformed coordinates;
in this case, the top panel indicates that in the direction of the rst principal
994
Random Phenomena
Scree Plot
8
7
Eigenvalue
6
5
4
3
2
1
0
1
7
8
9
10 11
Component Number
12
13
14
15
16
FIGURE 23.3: Scree plot showing that the rst two components are the most important.
component, the data set is essentially linear with a positive slope. The loading
plot indicates in what manner this component is represented in (or contributes
to) each of the original 16 variables.
The corresponding plot for the second principal component is shown in Fig
23.5 where we observe another interesting characteristic: the top panel (the
scores) indicates that in the direction of the second principal component, the
data is a downward pointing quadratic; the bottom panel, the loadings associated with each variable, indicates how this quadratic component contributes
to each variable.
Taken together, these plots indicate that the data consists of only two primary modes: PC1 is linear, and the more dominant of the two; the other, PC2,
is quadratic. Furthermore, the loadings associated with PC1 indicate that the
linear mode manifests negatively in two variables, Y6 and Y16 ; the variables
for which the loadings are strong and positive (i.e., Y1 , Y5 , Y7 , Y8 , Y9 , Y10 , Y12
and Y13 ) contain signicant amounts of the linear mode at roughly the same
level. For the other remaining 6 variables, the indication of Fig 23.4 is that
they do not contain any of the linear mode. The story for PC2 is similar but
complementary: the quadratic mode manifests negatively in two variables Y2
and Y15 , and positively in four variables, Y3 , Y4 , Y11 and Y14 . The quadratic
mode contributes virtually nothing to the other variables.
It is now interesting to return to the original data set in Fig 23.2 to compare
the raw data with the PCA results. It is now obvious that other than noise,
the data sets consists of linear and quadratic trends only, some positive, some
negative. The principal components reect these precisely. For example, 10 of
the 16 variables show the linear trends; the remaining 6 show the quadratic
trend. The rst principal component, PC1, reects the linear trend as the
995
5.0
Scores 1
2.5
0.0
-2.5
-5.0
1
10
20
30
40
50
Index
60
70
80
90
100
Loadings on PC1
0.4
0.3
0.2
PC1
0.1
0.0
-0.1
-0.2
-0.3
-0.4
y1
y2
y3
y4
y5
y6
y7
FIGURE 23.4: Plot of the scores and loading for the rst principal component. The
distinct trend indicated in the scores should be interpreted along with the loadings by
comparison to the full original data set in Fig 23.2.
996
Random Phenomena
4
3
2
Scores 2
1
0
-1
-2
-3
-4
-5
1
10
20
30
40
50
Index
60
70
80
90
100
Loadings on PC2
0.5
0.4
0.3
0.2
PC2
0.1
0.0
-0.1
-0.2
-0.3
-0.4
y1
y2
y3
y4
y5
y6
y7
FIGURE 23.5: Plot of the scores and loading for the second principal component. The
distinct trend indicated in the scores should be interpreted along with the loadings by
comparison to the full original data set in Fig 23.2.
997
more dominant and the variables with the linear trends are all identied in
the loadings of the PC1; no variable showing a quadratic trend is included
in this set of loadings. Furthermore, among the variables showing the linear
trend, the slope is negative in Y6 and in Y16 ; it is positive in the others.
This is reected perfectly in the loadings for PC1 where the values associated
with the variables Y6 and Y16 are negative, but positive for the others. In
the same manner the component capturing the downward pointing quadratic
trend, PC2, is associated with two groups of variables: (i) the variables Y2 and
Y15 , whose raw observations show an upward pointing quadratic (hence the
negative values of the loadings associated with PC2); and (ii) the variables,
Y3 , Y4 , Y11 and Y14 , which all show downward pointing quadratic trends; these
all have positive values in the PC2 loadings.
Finally, we show in Fig 23.6, the two-dimensional score and loading plots
for the rst component versus the second. Such plots are standard fare in
PCA. They are designed to show any relationship that might exist between
the scores in the new set of coordinates, and also how the loading vectors of
the rst two principal components are related. For this specic example, the
2-D score plot indicates a distinct quadratic relationship between t1 and t2 .
To appreciate the information encoded in this plot, observe that the scores
associated with PC1 (shown in Fig 23.4) appear linear in the form t1 = a1 x
where x represents the independent variable (because the data matrix has
been mean-centered, there is no need for an intercept). Likewise, the second
set of scores appear quadratic, in the form t2 = a2 x2 (where the exponent is
to be understood as a term-by-term squaring of the elements in the vector x)
so that indeed t2 = bt21 where b = a2 /a21 . This last expression, the relationship
between the two scores, is what the top panel of Fig 23.6 is encoding.
The 2-D loading plot reveals any relationships that might exist between
the new set of basis vectors constituting PC1 and PC2; it invariably leads
to clustering of the original variables according to patterns in the data. In
this particular case, this plot shows several things simultaneously: rst, its
North-South/West-East alignment indicates that in terms of the original data,
these two principal components are pure components: the linear component in
PC1 is pure, with no quadratic component; similarly, PC2 contains a purely
quadratic component. The plot also indicates that the variables Y6 and Y16 ,
cluster together, lying on the negative end of PC1; Y1 , Y5 , Y7 , Y8 , Y9 , Y10 , Y12
and Y13 also cluster together at the positive extreme of the pure component
PC1. The reader should now be able to interpret the vertical segregation and
clustering of the variables showing the quadratic trends.
To summarize, PCA has provided the following insight into this data set:
It contains only two modes: linear (the more dominant) and quadratic;
The 16 variables each contain these modes in pure form: the ones showing
the linear trend do not show the quadratic trend, and vice versa;
The variables can be grouped into four distinct categories: (i) Negative linear (Y6 , Y16 ); (ii) Positive linear (Y1 , Y5 , Y7 , Y8 , Y9 , Y10 , Y12 and
998
Random Phenomena
Score Plot
4
3
Second Component
2
1
0
-1
-2
-3
-4
-5
-5.0
-2.5
0.0
First Component
2.5
5.0
Loading Plot
0.5
y4
0.4
y 11 y 14
y3
Second Component
0.3
0.2
0.1
y 16
y6
0.0
y 1y 7
y9
yy10
13
8
y 5 y12
-0.1
-0.2
-0.3
y2
y 15
-0.4
-0.4
-0.3
-0.2
-0.1
0.0
0.1
First Component
0.2
0.3
0.4
FIGURE 23.6: Scores and loading plots for the rst two components. Top panel: Scores
plot indicates a quadratic relationship between the two scores t1 and t2 ; Bottom panel:
Loading vector plot indicates that in the new set of coordinates, the original variables
contain mostly pure components PC1 and PC2 indicated by a distinctive North/South
and West/East alignment of the data vectors, with like variables clustered together
according to the nature of the component contributions. Compare to the full original
data set in Fig 23.2.
999
Y13 ); (iii) Negative quadratic (Y2 and Y15 ); and (iv) Positive quadratic
(Y3 , Y4 , Y11 and Y14 ).
It is of course rare to nd problems for which the principal components
are as pure as in this example. Keep in mind, however, that this example is a
deliberate attempt to give the reader an opportunity to see rst a transparent
case where the PCA results can be easily understood. Once grasped, such
understanding is then easier to translate to less transparent cases. For example, had one of the variables contained a mixture of the linear and quadratic
trends, the extent of the mixture would have been reected in the loadings for
each of the scores: the length of the bar in Fig 23.4 would have indicated how
much of the linear trend it contains, with the corresponding bar in Fig 23.5
indicating the corresponding relative amount of the quadratic trend. The 2-D
loading vector for this variable will then lie at an angle (no longer horizontal
or vertical) indicative of the relative contribution of the linear PC1 and that
of PC2 to the raw data.
Additional information especially about implementing PCA in practice
may be found in Esbensen (2002)9 , Naes et al., (2002)10 , Brereton (2003)11 ,
and in Massart et al., (1988)12 .
23.3.4
1000
Random Phenomena
according to:
= t1 pT + t2 pT + . . . + tk pT + E
X
1
2
k
(23.57)
Two statistics are used to assess the new data set against normal operation:
(23.58)
is the error sum of squares for the ith sample (also known as the lack-of-modelt statistic); it provides a measure of how well the ith sample conforms to the
PCA model and represents the distance between the sample and its projection
onto the k-dimensional principal components space. A large value implies that
the new data does not t well with the correlation structure captured by the
PCA model.
The second statistic was actually introduced earlier in this chapter: the
Hotelling T 2 statistic,
i = tTi 1ti
Ti S1 x
Ti2 = x
(23.59)
in terms of the original data, and equivalently in terms of the PCA scores
and eigenvalues; it provides a measure of the variation within the new sample
relative to the variation within the model. A large Ti2 value indicates that the
data scores are much larger than those from which the model was developed.
It provides evidence that the new data is located in a region dierent from one
captured in the original data set used to build the PCA model. These concepts
are illustrated in Fig 23.7, adapted from Wise and Gallagher, (1996)13 .
To determine when large values of these statistics are signicant, control
limits must be developed for each one, but this requires making some distributional assumptions. Under normality assumptions, condence limits for the
T 2 statistic are obtained from the F -distribution, as indicated in Eq (23.16);
for Q, the situation is a bit more complicated but the limits are still easily
computed numerically (see e.g., Wise and Gallagher (1996)). Points falling
outside of the control limits then indicate an out-of-control multivariate process in precisely the same manner as with the univariate charts of the previous
chapter. These concepts are illustrated in Fig 23.8 for process data represented
with 2 principal components.
The Wise and Gallagher reference contains, among other things, additional
discussions about the application of PCA in process monitoring.
Model Building in Systems Biology
PCA continues to nd application in many non-traditional areas, with its
inltration into biological research receiving increasing attention. For example, the applications of PCA in regression and in building models have been
extended to signal transduction models is Systems Biology. From multivariate
13 Wise, B.M. and N. B. Gallagher, (1996). The process chemometrics approach to process
monitoring and fault detection, J Process Control, 6 (6) 329348.
1001
FIGURE 23.8: Control limits for Q and T 2 for process data represented with two
principal components.
1002
Random Phenomena
23.4
1003
REVIEW QUESTIONS
1. What is a multivariate probability model? How is it related to the single variable
pdf?
2. The role of the Gaussian distribution in univariate analysis is played in multivariate analysis by what multivariate distribution?
3. The Wishart distribution is the multivariate generalization of what univariate
distribution?
4. Hotellings T -squared distribution is the multivariate generalization of what univariate distribution?
5. What is the multivariate generalization of the F -distribution?
6. The Dirichlet distribution is the multivariate generalization of what univariate
distribution?
7. What is Principal Components Analysis (PCA) and what is it useful for?
8. In Principal Components Analysis (PCA), what is a loading vector and a
score vector?
1004
Random Phenomena
9. How are the loading vectors in PCA related to the data matrix?
10. How are the score vectors in PCA related to the data matrix and the loading
vectors?
11. In PCA, what is a Scree plot?
12. How is PCA used in process monitoring?
PROJECT ASSIGNMENT
Principal Components Analysis of a Gene Expression Data Set
In the Ringner (2008) reference provided earlier, the author used a set of
microarray data on the expression of 27,648 genes in 105 breast tumor samples
to illustrate how PCA can be used to represent samples with a smaller number of variables, visualize samples and genes, and detect dominant patterns
of gene expression. Only a brief summary of the resulting analysis was presented. The data set, collected by the author and his colleagues, and published
in Saal, et al., (2007)18 , is available through the National Center for Biotechnology Information Gene Expression Omnibus database (accession GSE5325)
and from http://icg.cpmc.columbia.edu/faculty parsons.htm.
Consult the original research paper, Saal, et al., (2007), (which includes
the application of other statistical analysis techniques, such as the MannWhitney-Wilcoxon test of Chapter 18, but not PCA) in order to understand
the research objectives and the nature of the data set. Then download the
data set and carry out your own PCA on it. Obtain the scores and loading
plots for the rst three principal components and generate 2-D plots similar
to those in Fig 23.6 in the text. Interpret the results to the best of your ability.
Write a report on your analysis and results, comparing them where possible
with those in Ringner (2008).
Appendix
1006
Appendix
might still want them, on-line electronic versions of statistical tables; we also
include a few other on-line resources that the reader might nd useful.
Appendix
1007
are now freely available electronically on-line. Some are deployed fully in electronic form, with precise probability or variate values computed on request
from pre-programmed probability distribution functions. Others are electronic only in the sense that the same numbers that used to be printed on
paper are now made available in an on-line table; they still require interpolation. Either way, if all one wants to do is simply compute tail area probabilities
for a wide variety of the usual probability distributions employed in statistical
inference, a dedicated software package is not required. Below is a listing of
three electronic statistical tables, their locations, and a brief summary of their
capabilities.
1. http://stattrek.com/Tables/StatTables.aspx
Conceived as a true on-line calculator, this site provides the capability
for computing, among other things, all manner of probabilities for a
wide variety of discrete and continuous distributions. There are clear
instructions and examples to assist the novice.
2. http://stat.utilities.googlepages.com/tables.htm
(SurfStat statistical tables by Keith Dear and Robert Brennan.)
Truly electronic versions of z-, t-, and 2 tables are available with a
convenient graphical user-interface that allows the user to specify either
the variate and compute the desired tail area probabilities, or to specify
the tail area probability and compute the corresponding variate. The
F -tables are available only as text.
3. http://www.statsoft.com/textbook/sttable.html
Provides actual tables of values computed using the commercial STATISTICA BASIC software; available tables include z-, t-, 2 ; the only
F -tables available are for F (0.1, 0.05) and F (0.025, 0.01). (The site includes animated Java images of the various probability distributions
showing various computed, and constantly changing, cumulative probabilities.)
1008
Appendix
(b) The StatSoft Electronic Statistics Textbook:
http://www.statsoft.com/textbook/stathome.html
3. Data sets: The following site contains links to a wide variety of statistical data, categorized by subject area and government agency. In
addition, it provides links to other sites containing statistical data.
http://www.libraryspot.com/statistics.htm
Also, NIST, the National Institute of Standards and Technology, has a
site dedicated to reference data sets with certied computational results.
The original purpose was to enable the objective evaluation of statistical
software. But instructors and students will nd the data sets to be a good
source of extra exercises (based on certied data).
http://www.itl.nist.gov/div898/strd/index.html
Index
1009
1010
bivariate Gaussian distribution, 980
bivariate random variable, 139
continuous, 149
denition, 139
discrete, 150
informal, 140
blocking, 805
blood pressure control system, mammalian, 929
Boltzmann, L. E., 353
Borel elds, 68
box plot, 427, 429
Box, G. E. P., 731, 811, 834, 851
Brauer, F., 890
Burman, J. P., 832
C-chart, 957
calculus of variations, 345
calibration curves, 659
Cardano, 198
Castillo-Chavez, C., 890
catalyst, 12
Cauchy distribution, 786
application in crystallography, 789
model for price uctuations, 786
Cauchy random variable, 314
application
high-resolution price uctiations,
316
mathematical characteristics, 315
probability model, 314
relation to other random variables,
316
cell culture, 173
Central Limit Theorem, 288, 468, 471,
608, 947
chain length distributions, 235
most probable, 235
Chakraborti, S, 778
characteristic function, 115116, 177
inversion formula, 116
characteristic parameters, 409
Chebyshevs inequality, 121, 229, 492
chemical engineering
illustration, 35
principles, 38
chemical process control, 964969
chemical reactors, 35
chi-square random variable
Index
application, 272
special case of gamma random variable, 271
Chi-squared goodness-of-t test, 739745
relation to z-test, 745
chi-squared test, 601
Chinese Hamster Ovary (CHO) cells,
330, 361, 453
Clarke, R. D., 868
co-polymer composition, 140
coecient of determination
adjusted, Ra2 dj, 673
coecient of determination, R2 , 672, 673
coecient of variation, 109
commercial coating process, 879
completely randomized design, 798
balanced or unbalanced, 799
complex system
crosslinks, 909
series-parallel conguration, 906
complex variable, 115
computer programs, 426
conditional distribution
general multivariate, 153
conditional expectation, 156
bivariate, 156
conditional mean, 157
conditional probability
empirical, 209
conditional probability distribution
denition, 147
conditional variance, 157
condence interval, 509
around regression line, 668
relation to hypothesis tests, 575
condence interval, 95%, 507
mean of normal population, 507
on the standard deviation, 510
condence intervals
in regression, 661
conjugate prior distribution, 862
consistent estimator, 492
constrained least squares estimate, 696
continuous stirred tank reactor (CSTR),
35, 193, 261, 325, 361
control charts, 946
C-chart, 957
CUSUM charts, 961
EWMA charts, 963
Index
1011
application, 220
model, 220
distribution
conditional, 147
joint, 141
joint-marginal, 156
marginal, 144, 145, 156
multimodal, 117
posterior, 520
prior, 520
symmetric, 109
unimodal, 117
distributions, 95, 107
leptokurtic , 111
moments of, 107
of several random variables, 141
platykurtic, 111
relationship between joint, conditional, marginal, 519
DNA replication origins
distances between, 269
Donev, A., 839
Doyle, III F. J., 976
Draper, N. R., 834
economic considerations, 12
eciency, 491
ecient estimator, 491
empirical frequencies, 205
engineering process control, 966
feedback error, 966
Proportional-Integral-Derivative (PID)
controllers, 966
engineering statistics, 2
ensemble, 41
entropy, 119, 338344
of Bernoulli random variable, 340
of continuous random variable, 340
dierential entropy, 342
of deterministic variable, 339
of discrete uniform random variable, 339
Erlang distribution, see gamma distribution
Erlang, A. K., 266
estimation, 489
estimator, 489
estimator characteristics
consistent, 492
1012
ecient, 491
unbiased, 490
estimators
criteria for choosing, 490493
method of moments, 493
sequence of, 492
Euler equations, 345, 346, 348350
event, 59
certain, 61, 63
complex, compound, 60
impossible, 61, 63, 67
simple, elementary, 60
events
compound, 64
elementary, 64
mutually exclusive, 63, 69, 78
EWMA charts, 961
related to Shewhart and CUSUM
charts, 963
expected value, 102
denition, 105
properties
absolute convergence, 105, 106
absolute integrability, 105, 106
experiment, 58
conceptual, 58, 64
experimental studies
phases of, 794
exponential distribution
memoryless distribution, 263
discrete analog of geometric distribution, 261
exponential pdf
application in failure time modeling, 913
exponential random variable, 260264
applications, 263
mathematical characteristics, 262
probability model, 261
special case of gamma random variable, 266
extra cellular matrix (ECM), 361
F distribution, 311, 474
F random variable, 309311
application
ratio of variances, 311
denition, 309
mathematical characteristics, 310
Index
probability model, 310
F-test, 604
in regression, 674
sensitivity to normality assumption,
605
factor levels, 795
factorial designs, 2k , 814
characteristics
balanced, orthogonal, 816
model, 816
sample size considerations, 818
factorial experiments, 814
factors, 795
failure rate, 911
failure times
distribution of, 913
failure-rate
decreasing, model of, 914
increasing, model of, 914
Fermat, 198
rst order ODE, 38
rst-principles approach, 2
Fisher information matrix, 837, 848
Fisher, R. A., 209, 255, 309, 541, 858
uidized bed reactor, 325
Fourier transforms, 116
fractional factorial designs
alias structure, 824
dening relation, 824
design resolution, 824
folding, 826
projection, 825
frequency distribution, 18, 20, 424, 426
frequency polygon, 426
frequentist approach, 426
functional genomics, 305
Gallagher, N. B., 1000
gamma distribution, 266
generalization of Erlang distribution, 266
model for distribution of DNA replication origins, 269
gamma pdf
application in failure time modeling, 917
gamma random variable, 180, 181, 264
271, 462
applications, 269
Index
generalization of exponential random variable, 264
mathematical characteristics, 266
probability model, 265
reproductive property of, 181
Garge, S., 853
Gaudet, S., 1002
Gauss, J.C.F., 279
Gaussian distribution, 654
bivariate, 980
Gaussian probability distribution, 288
Gaussian random variable, 279292
applications, 290
basic characteristics, 288
Herschel/Maxwell model, 285287
limiting case of binomial random
variable, 280
mathematical characteristics, 288
misconception of, 288
probability model, 288
random motion in a line, 282285
Gelmi, C. A., 306
gene expression data set, 1004
genetics, 199
geometric random variable, 234
application, 235
mathematical characteristics, 234
probability model, 234
geometric space, 44
Gibbons, J. D., 778
Gossett, W. S., 312
Gram polynomials, 706
granulation process, 296, 297
graphical techniques, 442
Graunt, John, 41
gravitational eld, 44
Greenwood, M., 252
group classication, 18
hazard function, 123, 272, 911
bathtub curve, 912
equivalently, failure rate, 912
heat transfer coecient, 34
Hendershot, R. J, 837
hereditary factors, 203
genes, 203
heredity
dominance/recessiveness, 203
genotype, 204
1013
phenotype, 204
Heusner, A. A., 729
Hirschfelder-Curtiss-Bird, 638
histogram, 18, 424
for residual errors, 678
Hoerl, A.E., 697
Hotelling T 2 statistic, 1000
Hotellings T -squared distribution, 982
Hotelling, H., 982
housekeeping genes, 764
Hunter, J. S., 811
Hunter, W. G., 811
Huygens, 198
hydrocarbons, 455
hypergeometric random variable, 222
application, 224
mathematical characteristics, 224
probability model, 223
hypothesis
alternative, Ha , 560
null, H0 , 560
one-sided, 555
two-sided, 554
hypothesis test, 555
p-value, 560
error
Type I, 557
Type II, 558
general procedure, 560
non-Gaussian populations, 613
power and sample size determination, 591600
power of, 558
risks
-risk, 558
-risk, 558
two proportions, 611
using MINITAB, 573
hypothesis test, signicance level of, 557
hypothesis testing
application to US census data, 876
in regression, 664
ideal probability models, 4
in-vitro fertilization (IVF), 42, 101, 225,
413
binomial model
sensitivity analysis of, 393
binomial model for, 372, 377
1014
binomial model validation, 375
Canadian guidelines, 370
central characteristics, 371
clinical data, 367
clinical studeis, 367
Elsner clinical data, 375
Elsner clinical study, 369
Elsner study
study characteristics, 375
guidelines and policy, 370
implantation potential, 369, 372
mixture distribution model, 382
model-based optimization, 384
multiple births, 365
risk of, 366
oocyte donation, 364
optimization problem, 385
optimum number of embryos, n ,
385, 387
patient categorization, 390
SEPS parameter
non-uniformity, 380
single embryo probability of success
parameter (SEPS), 372
treatment cycle, 372
treatment outcomes
theoretical analysis of, 390
in-vitro fertilization (IVF) treatment
binomial model of, 397
model-based analysis of, 393
inclusions, 16
independence
pairwise, 78
stochastic, 158
information
quantifying content of, 337
relation to uncertainty, 336
information content, 119
interspike intervals
distribution of, 772
interval estimate, 490
interval estimates, 506518
dierence between the two population means, 512
for regression parameters, 661
mean, unknown, 508
non-Gaussian populations, 514
variance of normal population, 510
interval estimation, 489
Index
interval estimator, 490
inverse bivariate transformation, 182
inverse gamma random variable, 325, 550
inverse transformation, 172
Jacobian
of bivariate inverse transformation,
182
of inverse transformation, 175, 176
Janes, K. A., 1002
Johnson, N. L., 979
joint probability distribution
denition, 142
joint probability distribution function,
144
Jovanovic, L., 976
Kalman lter, 700
kamikaze pilots, 209
Kent, J. T., 982
keystone component, 910
Kholodenko, B. N., 839
Kimball, G. E., 210
Kingman, J.F.C, 68
Kleibers law, 194, 729
Kolmogorov, 67, 98, 198
Kolmogorov-Smirnov (K-S) test, 770
test statistic, 771
Konishi, S., 970
Kotz, S., 979
Kruskall-Wallis test, 805
nonparametric one-way ANOVA,
805
kurtosis, 111
coecient of, 111
Lagrange multiplier, 345, 346, 353
Lagrangian functional, 345
Laplace, 198, 519
Laplace transforms, 116
Lauenburger, D. A., 1002
Lauterbach, J., 837
law of large numbers, 229
least sqares
estimator
properties, 660
unbiased, 660
least squares
constrained, 696
Index
estimator, 652
method of, 653
ordinary, 654
principles of, 651, 652
recursive, 698, 699
weighted, 694
Lenth, R. V., 829
life tests, 919
accelerated tests, 919
nonreplacement tests, 919
replacement tests, 919
test statistic, 920
truncated tests, 919
life-testing, 275
likelihood function, 497
likelihood ratio tests, 616623
linear operator, 107
log-likelihood function, 498
logarithmic series distribution, 255
logarithmic series random variable, 248
logistic distribution, 329
(standard), 326
lognormal distribution, 293
multiplicative characteristics, 294
lognormal random variable, 292297
applications, 296
central location of
median more natural, 296
mathematical characteristics, 293
probability model, 293
relationship to Gaussian random
variable, 293
loss function
quadratic, 971
Lucas, J. M., 482, 750, 963
Macchietto, S., 839
macromolecules, 113
Malaya, butteries of, 255, 541
Mann-Whitney U test statistic, 767
Mann-Whitney-Wilcoxon (MWW) test,
766769
nonparameteric alternative to 2sample t-test, 769
manufactured batch, 66
Mardia, K. V., 982
marginal expectation, 156
marginal probability distribution
denition, 145
1015
marginal variance, 156, 157
Markov Chain Monte Carlo (MCMC),
527
Markovs inequality, 121
Marquardt, D.W., 698
material balance, 38
mathematical biology, 209
mathematical expectation, 102, 154
marginal, 156
of jointly distributed random variables, 154
maximum `
a-posteriori (MAP) estimate,
863
maximum `
a-posteriori (MAP) estimator,
520
maximum entropy distribution, 346
beta pdf, 351, 354
continuous uniform pdf, 348
discrete uniform pdf, 352
exponential pdf, 349, 352
gamma pdf, 354
Gaussian pdf, 350, 352
geometric pdf, 347, 352
maximum entropy models, 344354
from general expectations, 351
maximum entropy principle, 344
maximum likelihood, 496503
maximum likelihood estimate, 498
characteristics, 501
Gaussian population parameters,
500
in regression, 657
maximum likelihood principle, 616
mean, 437
limiting distribution of, 467
sample, 438
sampling distribution of, 465
normal approximation, 468
standard error of, 467
mean absolute deviation from the median (MADM), 440
mean-time-between-failures (MTBF), 913
median, 117, 118, 437
sample, 439
melt index, 140
Mendels experiments
multiple traits
pairwise, 205
pea plants
1016
characteristics, 199
genetic traits, 200
results, 207
single traits, 201
Mendel, Gregor, 199
method of moments, 493496
method of moments estimators
properties
consistent, 496
not unique, 496
microarray
reference spot, 194
test spot, 193
microarray data, 306
fold change ratio, 194
mixture distributions
Beta-Binomial, 328, 382
application to Elsner data, 400
Poisson-Gamma, 278
mode, 117, 438
sample, 439
molecular weight
z average, Mz , 113
distributions (MWD), 113
non-uniform, 113
number average, Mn , 113
weight average, Mw , 113
molecular weight distribution, 140
moment generating function, 113
independent sums, 115
linear transformations, 115
marginal, 156
uniqueness, 114
moments, 107
kth ordinary, 107
central, 108
rst, ordinary, 107
second central, 108
monomer molecules, 41
Mooney viscosity, 952
Morse, P. M., 210
multinomial random variable, 231
multivariate normal distribution, 980
multivariate probability model, 978
multivariate process monitoring, 999
multivariate random variable
denition, 141
multivariate transformations, 184
non-square, 185
Index
overdened, 185
underdened, 185
square, 184
Mylar , 192
negative binomial distribution, 233
as the Poisson-Gamma mixture distribution, 278
negative binomial random variable, 232
234
mathematical characteristics, 233
probability model, 232
alternative form, 233
Nelson, W. B., 919
nonparametric methods, 759
robust, 759
normal approximation, 468
normal distribution, see Gaussian probability distribution
normal equations, 655
matrix form of, 689
normal population, 471
normal probability plot, 735
application in factorial designs, 829
normality test
for residual errors, 678
null hypothesis, 554
Ogunnaike, B. A., 618, 721, 728,
839, 929, 952, 966, 1002
one-sample sign test, 760
nonparametric alternative to
sample t-test, 763
nonparametric alternative to
sample z-test, 763
test statistic, 761
sampling distribution, 761
operating characteristic curve, 939
relation to power and sample
942
opinion pollster, 488
optimal experimental designs, 837
A-, D-, E-, G-, and V-, 838
Ospina, P., 976
outcome, 58
outcomes
equiprobable, 76
outlier, 428
overdispersion, 242
837,
oneone-
size,
Index
Owens, C., 976
P-chart, 954
p-value, 560
boderline, 623
observed signicance level, 560
P
olya, George, 233
paper helicopter, 891
Pareto chart, 421
Pareto random variable, 328
Pareto, V. F., 421
particle size distribution, 113, 296
particulate products, 113
parts, defective, 66
Pascal distribution, 233
Pascal, Blaise, 198, 233
pdf, 100
Pearson, 18
Pearson, K., 985
percentiles, 119
phenomenological mechanisms, 3
Philadelphia Eagles, 785
point dierential, 2008/2009 season,
785
points scored by, 2008/2009 season,
785
pie chart, 421
Plackett, R.L., 832
plug ow reactor (PFR), 35
point estimate, 489
point estimates
precision of, 503506
binomial proportion, 505
mean, known, 504
mean, unknown, 505
variance, 505
point estimation, 489
Poisson distribution, 496
Poisson events, 260
Poisson model, 859
Poisson random variable, 173, 174, 236
243, 463
applications, 240
limiting form of binomial random
variable, 236
mathematical characteristics, 239
overdispersed, 242
negative binomial model, 242
probability model, 237, 239
1017
probabilty model
from rst principles, 237
Poisson-Gamma mixture distribution,
276278
Polya distribution, see negative binomial
distribution
polydispersity index (PDI), 113
polymer reactor, 143
polymer resin, 63
polymeric material, 113
polymerization
free radical, 235
polynomial regression, 701
orthogonal, 704
population, 411, 488
dichotomous, 222
posterior distribution, 520, 522, 863
Pottmann, M., 618, 929
power and sample size
computing with MINITAB, 599
pre-image, 91
prediction error, 668
prediction intervals, 668
principal component
loading vectors, 986
score vectors, 987
principal components, 985
principal components analysis (PCA),
985
application in systems biology, 1000
scale dependent, 986
Scree plot, 990
prior distribution, 520
uniform, 523
probabilistic framework, 3, 47
probability, 43, 69
a-posteriori, 77
`
a-priori, 77
`
application by Mendel, 204
bounds, 119
general lemma, 120
calculus of, 68, 69
classical `
a-priori, 45
conditional, 7274, 147
equiprobale assignment of, 70
mathematical theory of, 67
relative frequency `
a-posteriori, 46
set function, 67, 90
bivariate, 139
1018
induced, 90, 95
subjective, 46
theory, 414
total, 74
Theorem of, 76
probability density function
denition, 98
joint, 142
probability distribution function, 23, 43,
47, 78, 96, 100
conditional, 147
convolution of, 179
denition, 98
joint, 142
marginal, 145
probability distributions
chart of connections, 319
probability model validation, 732
probability paper, 734
probability plots, 733739
modern, 736
probability theory, 58
application to in-vitro fertilization,
395
probabilty
total
Theorem of, 910
process
chemical, 12
chemical manufacturing, 2
manufacturing, 16, 43, 71
yield, 12
process control, 410, 944
process dynamics, 275, 410
process identication, 410
product law of reliabilities, 903
product law of unreliabilities, 904
product quality, 16
propagation-of-errors
application, 194
Q statistic, 1000
quality assurance, 16
quantiles, 119
quantization error, 342
quartiles, 118
random components, 11
random uctuations, 205
Index
random mass phenomena, 41
random phenomena, 3
random sample, 460
random variability, 14
random variable, 90, 103, 412
n-dimensional, 141, 172
bivariate, 139
continuous, 146
continuous, 94
denition, 90
discrete, 94
entropy
denition, 119
entropy of, 338
informal, 94
kurtosis, 111
mechanistic underpinnings of, 218
moments
practical applications, 113
multivariate, 164, 978
ordinary moment of, 107
realizations, 409
skewness, 109
two-dimensional, 95, 143
random variable families
Gamma family, 259
generalized model, 276
Gaussian family, 278
generalized model, 300
Ratio family, 300
random variable space, 96, 103
bivariate, 139
random variable sums
pdfs of, 177
cdf approach, 177
characteristic function approach,
177, 180
random variable transformations
bivariate, 182
continuous case, 175
discrete case, 173
general continuous case, 176
non-monotone, 188
single variable, 172
random variables
mutually stochastically independent,
161
continuous, 98
discrete, 95
Index
inter-related, 139
negatively correlated, 158
positively correlated, 158
randomized complete block design, 805
Ray, W. H., 728, 952, 966
Rayleigh distribution, 298
relationship to Weibull distribution,
299
Rayleigh random variable, 297
application, 300
probability model, 298
regression
multiple linear, 686
matrix methods, 688
regression line, 663
regression model
mean-centered, 677
one-parameter, 653
two-parameter, 653
regression parameters, 661
rejection region, 556
relative frequency, 18, 424
relative sensitivity function, 394
reliability, 225, 900
component, 901
denition, 900
system, 901
reliability function, 911
residence time, 35
residence time distribution, 193
exponential, 38
residual
standardized, 679
residual analysis, 678682
residual error, 658, 678
residual sum-of-squares, 662
response, 795
response surface designs, 834
application to process optimization,
879
Box-Behnken, 835
face centered cube, 835
ridge regression, 697
Riemann integral, 98
Ringner, M., 1002
Rogers, W. B., 837
Ross, P.J., 970
Ryan, T.P., 837
1019
S-chart, 948
sample, 411
sample range, 440
sample space, 59, 61, 68, 103, 412
discrete, 59, 69
sample variance, 441
sampling distribution, 462
of single variance, 473
of two variances, 474
scatter plot, 431, 648
Schwaber, J. S., 618, 929
screening designs
fractional factorial, 822
Plackett-Burman, 833
sensitivity function, 684
set
empty, null, 61
set function
additive, 66, 67
probability, 67
set functions, 65
set operations, 61
sets, 61
algebra, 61
complement, 61
disjoint, 63, 66
intersection, 61
partitioned, 75
union, 61
Sherwood, T.K., 724
Shewhart, W. A., 947
sickle-cell anemia, 252
signal-to-noise ratio, 597
signed ranks, 763
simple system, 901
parallel conguration, 904
series conguration, 903
single factor experiments, 797811
completely randomized design, 798
data layout, 798
model and hypothesis, 798
two-way classication, 805
randomized complete block design, 805
single proportions, exact test, 610
skewness, 109
coecient of, 109
Snively, C. M., 837
Sohoni, M. A., 1002
1020
Sorger, P. K., 1002
specication limits, 946
dierent from control limits, 946
standard Cauchy distribution, 315
standard deviation, 109
pooled, 579
sample, 441
standard normal distribution, 290, 468,
471
standard normal random variable, 290
mathematical characteristics, 292
relationship to the Chi-square random variable, 292
standard uniform random variable, 308
statistic, 461
statistical hypothesis, 554
statistical inference, 412, 470
in life testing, 919
statistical process control, 944
statistics, 415
descriptive, 415
graphical, 416
numerical, 416
inductive, 415
inferential, 415
Stirzaker, D., 130
stochastic independence, 162
denition, 160
mutual, 163
pairwise, 163
Students t random variable, 311314
application, 314
denition, 312
mathematical characteristics, 312
probability model, 312
Students t-distribution, 471
sum-of-squares function, 653
survival function, 122, 911
system reliability function, 918
t-distribution, 312, 314, 471, 508
t-test, 570
one-sample, 571
paired, 586
two-sample, 579, 581
Taguchi, G., 969
Taylor series approximation, 195
Taylor series expansion, 114
Taylor, S.J., 68
Index
temperature control system
reliability, 143
Tendulkar, A. V., 1002
test statistic, 556
theoretical distribution, 22
thermal conductivity, 455
thermocouple calibration, 194
Thermodynamics, 44
time-to-failure, 269, 275
Tobias, R., 839
total quality management, 935
transformations
monotonic, 175
non-monotonic, 176
nonlinear, 174
single variable, 172
treatment, 795
trial, 59
trinomial random variable, 230
probability model, 230
Tukey, J., 427
two-factor experiments, 811
model and hypothesis, 812
randomized complete block design,
two-way crossed, 812
two-sample tests, 576
unbiased estimator, 490
uncertainty, 11
uniform random variable, 176
uniform random variable (continuous)
application
random number generation, 309
mathematical characteristics, 308
probability model, 308
relation to other random variables,
309
special case of beta random variable, 308
universal set, 62
US legal system
like hypothesis test, 556
US population, 435
age distribution, 456
US Population data, 870
variability, 2
common cause, special cause, 945
variable
Index
dependent, 650
discrete, 17
qualitative, 417
quantitative, 417
variable transformation
in regression, 685
variables
dependent, 795
independent, 795
nuisance, 811
variance
denition, 108
sample, 441
sampling distribution of, 473
Venn diagram, 66
von Bortkiewicz, L., 857
Wall Street Journal (WSJ), 364
Wangikar, P. P., 1002
Weibull pdf
application in failure time modeling, 914
Weibull random variable, 272275
applications, 275
mathematical characteristics, 274
probability model, 273
relation to exponential random variable, 272
Weibull, Waloddi, 273
weighted average, 75
Welf, E. S., 361
Westphal, S. P., 364
Wilcoxon signed rank test, 763
normal approximation not recommended, 764
restricted to symmetric distributions, 763
Wilks Lambda distribution, 982
Wilks, S. S., 982
Wise, B.M., 1000
Wishart distribution, 981
multivariate generalization of 2
distribution, 981
Wishart, J., 981
World War II, 209
Xbar-R chart, 951
Yae, M. B., 1002
1021
yield improvement, 12
Yule, G. U., 252
z-score, 290, 564
z-shift, 592
z-test, 563
one-sample, 566
single proportion, large sample, 608
two-sample, 577
Zisser, H., 976
Zitarelli, D. E., 210