1 Exploratory Data Analysis
1 Exploratory Data Analysis
This chapter presents the assumptions, principles, and techniques necessary to gain insight into
data via EDA--exploratory data analysis.
1. EDA Introduction [1.1.]
1. What is EDA? [1.1.1.]
2. How Does Exploratory Data Analysis differ from Classical Data Analysis? [1.1.2.]
1. Model [1.1.2.1.]
2. Focus [1.1.2.2.]
3. Techniques [1.1.2.3.]
4. Rigor [1.1.2.4.]
5. Data Treatment [1.1.2.5.]
6. Assumptions [1.1.2.6.]
3. How Does Exploratory Data Analysis Differ from Summary Analysis? [1.1.3.]
4. What are the EDA Goals? [1.1.4.]
5. The Role of Graphics [1.1.5.]
6. An EDA/Graphics Example [1.1.6.]
7. General Problem Categories [1.1.7.]
Philosophy EDA is not identical to statistical graphics although the two terms are
used almost interchangeably. Statistical graphics is a collection of
techniques--all graphically based and all focusing on one data
characterization aspect. EDA encompasses a larger venue; EDA is an
approach to data analysis that postpones the usual assumptions about
what kind of model the data follow with the more direct approach of
allowing the data itself to reveal its underlying structure and model.
EDA is not a mere collection of techniques; EDA is a philosophy as to
how we dissect a data set; what we look for; how we look; and how we
interpret. It is true that EDA heavily uses the collection of techniques
that we call "statistical graphics", but it is not identical to statistical
graphics per se.
History The seminal work in EDA is Exploratory Data Analysis, Tukey, (1977).
Over the years it has benefitted from other noteworthy publications such
as Data Analysis and Regression, Mosteller and Tukey (1977),
Interactive Data Analysis, Hoaglin (1977), The ABC's of EDA,
Velleman and Hoaglin (1981) and has gained a large following as "the"
way to analyze a data set.
Techniques Most EDA techniques are graphical in nature with a few quantitative
techniques. The reason for the heavy reliance on graphics is that by its
very nature the main role of EDA is to open-mindedly explore, and
graphics gives the analysts unparalleled power to do so, enticing the
data to reveal its structural secrets, and being always ready to gain some
new, often unsuspected, insight into the data. In combination with the
natural pattern-recognition capabilities that we all possess, graphics
provides, of course, unparalleled power to carry this out.
The particular graphical techniques employed in EDA are often quite
simple, consisting of various techniques of:
1. Plotting the raw data (such as data traces, histograms,
bihistograms, probability plots, lag plots, block plots, and Youden
plots.
2. Plotting simple statistics such as mean plots, standard deviation
plots, box plots, and main effects plots of the raw data.
3. Positioning such plots so as to maximize our natural
pattern-recognition abilities, such as using multiple plots per
page.
Paradigms These three approaches are similar in that they all start with a general
for Analysis science/engineering problem and all yield science/engineering
Techniques conclusions. The difference is the sequence and focus of the
intermediate steps.
For classical analysis, the sequence is
Problem => Data => Model => Analysis => Conclusions
For EDA, the sequence is
Problem => Data => Analysis => Model => Conclusions
For Bayesian, the sequence is
Problem => Data => Model => Prior Distribution => Analysis =>
Conclusions
Method of Thus for classical analysis, the data collection is followed by the
dealing with imposition of a model (normality, linearity, etc.) and the analysis,
underlying estimation, and testing that follows are focused on the parameters of
model for that model. For EDA, the data collection is not followed by a model
the data imposition; rather it is followed immediately by analysis with a goal of
distinguishes inferring what model would be appropriate. Finally, for a Bayesian
the 3 analysis, the analyst attempts to incorporate scientific/engineering
approaches knowledge/expertise into the analysis by imposing a data-independent
distribution on the parameters of the selected model; the analysis thus
consists of formally combining both the prior distribution on the
parameters and the collected data to jointly make inferences and/or test
assumptions about the model parameters.
In the real world, data analysts freely mix elements of all of the above
three approaches (and other approaches). The above distinctions were
made to emphasize the major differences among the three approaches.
1.1.2.1. Model
Classical The classical approach imposes models (both deterministic and
probabilistic) on the data. Deterministic models include, for example,
regression models and analysis of variance (ANOVA) models. The most
common probabilistic model assumes that the errors about the
deterministic model are normally distributed--this assumption affects the
validity of the ANOVA F tests.
Exploratory The Exploratory Data Analysis approach does not impose deterministic
or probabilistic models on the data. On the contrary, the EDA approach
allows the data to suggest admissible models that best fit the data.
1.1.2.2. Focus
Classical The two approaches differ substantially in focus. For classical analysis,
the focus is on the model--estimating parameters of the model and
generating predicted values from the model.
Exploratory For exploratory data analysis, the focus is on the data--its structure,
outliers, and models suggested by the data.
1.1.2.3. Techniques
Classical Classical techniques are generally quantitative in nature. They include
ANOVA, t tests, chi-squared tests, and F tests.
Exploratory EDA techniques are generally graphical. They include scatter plots,
character plots, box plots, histograms, bihistograms, probability plots,
residual plots, and mean plots.
1.1.2.4. Rigor
Classical Classical techniques serve as the probabilistic foundation of science and
engineering; the most important characteristic of classical techniques is
that they are rigorous, formal, and "objective".
Exploratory EDA techniques do not share in that rigor or formality. EDA techniques
make up for that lack of rigor by being very suggestive, indicative, and
insightful about what the appropriate model should be.
EDA techniques are subjective and depend on interpretation which may
differ from analyst to analyst, although experienced analysts commonly
arrive at identical conclusions.
Exploratory The EDA approach, on the other hand, often makes use of (and shows)
all of the available data. In this sense there is no corresponding loss of
information.
1.1.2.6. Assumptions
Classical The "good news" of the classical approach is that tests based on
classical techniques are usually very sensitive--that is, if a true shift in
location, say, has occurred, such tests frequently have the power to
detect such a shift and to conclude that such a shift is "statistically
significant". The "bad news" is that classical tests depend on underlying
assumptions (e.g., normality), and hence the validity of the test
conclusions becomes dependent on the validity of the underlying
assumptions. Worse yet, the exact underlying assumptions may be
unknown to the analyst, or if known, untested. Thus the validity of the
scientific conclusions becomes intrinsically linked to the validity of the
underlying assumptions. In practice, if such assumptions are unknown
or untested, the validity of the scientific conclusions becomes suspect.
Exploratory In contrast, EDA has as its broadest goal the desire to gain insight into
the engineering/scientific process behind the data. Whereas summary
statistics are passive and historical, EDA is active and futuristic. In an
attempt to "understand" the process and improve it in the future, EDA
uses the data as a "window" to peer into the heart of the process that
generated the data. There is an archival role in the research and
manufacturing world for summary statistics, but there is an enormously
larger role for the EDA approach.
Insight into Insight implies detecting and uncovering underlying structure in the
the Data data. Such underlying structure may not be encapsulated in the list of
items above; such items serve as the specific targets of an analysis, but
the real insight and "feel" for a data set comes as the analyst judiciously
probes and explores the various subtleties of the data. The "feel" for the
data comes almost exclusively from the application of various graphical
techniques, the collection of which serves as the window into the
essence of the data. Graphics are irreplaceable--there are no quantitative
analogues that will give the same insight as well-chosen graphics.
To get a "feel" for the data, it is not enough for the analyst to know what
is in the data; the analyst also must know what is not in the data, and the
only way to do that is to draw on our own human pattern-recognition
and comparative abilities in the context of a series of judicious graphical
techniques applied to the data.
● graphical
Quantitative Quantitative techniques are the set of statistical procedures that yield
numeric or tabular output. Examples of quantitative techniques include:
● hypothesis testing
● analysis of variance
● point estimates and confidence intervals
● least squares regression
These and similar techniques are all valuable and are mainstream in
terms of classical analysis.
Graphical On the other hand, there is a large collection of statistical tools that we
generally refer to as graphical techniques. These include:
● scatter plots
● histograms
● probability plots
● residual plots
● box plots
● block plots
EDA The EDA approach relies heavily on these and similar graphical
Approach techniques. Graphical procedures are not just tools that we could use in
Relies an EDA context, they are tools that we must use. Such graphical tools
Heavily on are the shortest path to gaining insight into a data set in terms of
Graphical ● testing assumptions
Techniques
● model selection
● model validation
● estimator selection
● relationship identification
● outlier detection
If one is not using statistical graphics, then one is forfeiting insight into
one or more aspects of the underlying structure of the data.
Data
X Y
10.00 8.04
8.00 6.95
13.00 7.58
9.00 8.81
11.00 8.33
14.00 9.96
6.00 7.24
4.00 4.26
12.00 10.84
7.00 4.82
5.00 5.68
Scatter Plot In contrast, the following simple scatter plot of the data
Three This kind of characterization for the data serves as the core for getting
Additional insight/feel for the data. Such insight/feel does not come from the
Data Sets quantitative statistics; on the contrary, calculations of quantitative
statistics such as intercept and slope should be subsequent to the
characterization and will make sense only if the characterization is
true. To illustrate the loss of information that results when the graphics
insight step is skipped, consider the following three data sets
[Anscombe data sets 2, 3, and 4]:
X2 Y2 X3 Y3 X4 Y4
10.00 9.14 10.00 7.46 8.00 6.58
8.00 8.14 8.00 6.77 8.00 5.76
13.00 8.74 13.00 12.74 8.00 7.71
Scatter Plots
Importance These points are exactly the substance that provide and define "insight"
of and "feel" for a data set. They are the goals and the fruits of an open
Exploratory exploratory data analysis (EDA) approach to the data. Quantitative
Analysis statistics are not wrong per se, but they are incomplete. They are
incomplete because they are numeric summaries which in the
summarization operation do a good job of focusing on a particular
aspect of the data (e.g., location, intercept, slope, degree of relatedness,
etc.) by judiciously reducing the data to a few numbers. Doing so also
filters the data, necessarily omitting and screening out other sometimes
crucial information in the focusing operation. Quantitative statistics
focus but also filter; and filtering is exactly what makes the
quantitative approach incomplete at best and misleading at worst.
The estimated intercepts (= 3) and slopes (= 0.5) for data sets 2, 3, and
4 are misleading because the estimation is done in the context of an
assumed linear model and that linearity assumption is the fatal flaw in
this analysis.
Univariate
UNIVARIATE CONTROL
and Control
Data: Data:
A single column of A single column of
numbers, Y. numbers, Y.
Model: Model:
y = constant + error y = constant + error
Output: Output:
1. A number (the estimated A "yes" or "no" to the
constant in the model). question "Is the system
2. An estimate of uncertainty out of control?".
for the constant. Techniques:
3. An estimate of the ● Control Charts
distribution for the error.
Techniques:
● 4-Plot
● Probability Plot
● PPCC Plot
Comparative
COMPARATIVE SCREENING
and
Screening Data: Data:
A single response variable A single response variable
and k independent and k independent
variables (Y, X1, X2, ... , variables (Y, X1, X2, ... ,
Xk), primary focus is on Xk).
one (the primary factor) of Model:
these independent
y = f(x1, x2, ..., xk) + error
variables.
Model: Output:
y = f(x1, x2, ..., xk) + error 1. A ranked list (from most
important to least
Output: important) of factors.
A "yes" or "no" to the 2. Best settings for the
question "Is the primary factors.
factor significant?".
3. A good model/prediction
Techniques: equation relating Y to the
● Block Plot factors.
● Scatter Plot Techniques:
● Box Plot ● Block Plot
● Probability Plot
● Bihistogram
Optimization
OPTIMIZATION REGRESSION
and
Regression Data: Data:
A single response variable A single response variable
and k independent and k independent
variables (Y, X1, X2, ... , variables (Y, X1, X2, ... ,
Xk). Xk). The independent
Model: variables can be
continuous.
y = f(x1, x2, ..., xk) + error
Model:
Output:
y = f(x1, x2, ..., xk) + error
Best settings for the factor
variables. Output:
Techniques: A good model/prediction
equation relating Y to the
● Block Plot
factors.
● Scatter Plot
● 6-Plot
Time Series
TIME SERIES MULTIVARIATE
and
Multivariate Data: Data:
A column of time k factor variables (X1, X2, ... ,
dependent numbers, Y. Xk).
In addition, time is an
indpendent variable. Model:
The time variable can The model is not explicit.
be either explicit or Output:
implied. If the data are Identify underlying
not equi-spaced, the correlation structure in the
time variable should be data.
explicitly provided.
Techniques:
Model:
● Star Plot
yt = f(t) + error
● Scatter Plot Matrix
The model can be either
a time domain based or ● Conditioning Plot
frequency domain ● Profile Plot
based.
● Principal Components
Output:
Clustering
●
A good
● Discrimination/Classification
model/prediction
equation relating Y to Note that multivarate analysis is
previous values of Y. only covered lightly in this
Techniques: Handbook.
● Autocorrelation Plot
● Spectrum
● Complex Demodulation
Amplitude Plot
● Complex Demodulation
Phase Plot
● ARIMA Models
Univariate or The "fixed location" referred to in item 3 above differs for different
Single problem types. The simplest problem type is univariate; that is, a
Response single variable. For the univariate problem, the general model
Variable response = deterministic component + random component
becomes
response = constant + error
Assumptions For this case, the "fixed location" is simply the unknown constant. We
for Univariate can thus imagine the process at hand to be operating under constant
Model conditions that produce a single column of data with the properties
that
● the data are uncorrelated with one another;
Extrapolation The universal power and importance of the univariate model is that it
to a Function can easily be extended to the more general case where the
of Many deterministic component is not just a constant, but is in fact a function
Variables of many variables, and the engineering objective is to characterize and
model the function.
Residuals Will The key point is that regardless of how many factors there are, and
Behave regardless of how complicated the function is, if the engineer succeeds
According to in choosing a good model, then the differences (residuals) between the
Univariate raw response data and the predicted values from the fitted model
Assumptions should themselves behave like a univariate process. Furthermore, the
residuals from this univariate process fit will behave like:
● random drawings;
Validation of Thus if the residuals from the fitted model do in fact behave like the
Model ideal, then testing of underlying assumptions becomes a tool for the
validation and quality of fit of the chosen model. On the other hand, if
the residuals from the chosen fitted model violate one or more of the
above univariate assumptions, then the chosen fitted model is
inadequate and an opportunity exists for arriving at an improved
model.
1.2.2. Importance
Predictability Predictability is an all-important goal in science and engineering. If the
and four underlying assumptions hold, then we have achieved probabilistic
Statistical predictability--the ability to make probability statements not only
Control about the process in the past, but also about the process in the future.
In short, such processes are said to be "in statistical control".
Validity of Moreover, if the four assumptions are valid, then the process is
Engineering amenable to the generation of valid scientific and engineering
Conclusions conclusions. If the four assumptions are not valid, then the process is
drifting (with respect to location, variation, or distribution),
unpredictable, and out of control. A simple characterization of such
processes by a location estimate, a variation estimate, or a distribution
"estimate" inevitably leads to engineering conclusions that are not
valid, are not supportable (scientifically or legally), and which are not
repeatable in the laboratory.
Four Techniques The following EDA techniques are simple, efficient, and powerful
to Test for the routine testing of underlying assumptions:
Underlying 1. run sequence plot (Yi versus i)
Assumptions
2. lag plot (Yi versus Yi-1)
3. histogram (counts versus subgroups of Y)
4. normal probability plot (ordered Y versus theoretical ordered
Y)
Plot on a Single The four EDA plots can be juxtaposed for a quick look at the
Page for a characteristics of the data. The plots below are ordered as follows:
Quick 1. Run sequence plot - upper left
Characterization
2. Lag plot - upper right
of the Data
3. Histogram - lower left
4. Normal probability plot - lower right
Sample Plot:
Assumptions
Hold
This 4-plot reveals a process that has fixed location, fixed variation,
is random, apparently has a fixed approximately normal
distribution, and has no outliers.
Sample Plot: If one or more of the four underlying assumptions do not hold, then
Assumptions Do it will show up in the various plots as demonstrated in the following
Not Hold example.
This 4-plot reveals a process that has fixed location, fixed variation,
is non-random (oscillatory), has a non-normal, U-shaped
distribution, and has several outliers.
Plots Utilized Conversely, the underlying assumptions are tested using the EDA
to Test the plots:
Assumptions ● Run Sequence Plot:
If the run sequence plot is flat and non-drifting, the
fixed-location assumption holds. If the run sequence plot has a
vertical spread that is about the same over the entire plot, then
the fixed-variation assumption holds.
● Lag Plot:
If the lag plot is structureless, then the randomness assumption
holds.
● Histogram:
If the histogram is bell-shaped, the underlying distribution is
symmetric and perhaps approximately normal.
● Normal Probability Plot:
1.2.5. Consequences
What If If some of the underlying assumptions do not hold, what can be done
Assumptions about it? What corrective actions can be taken? The positive way of
Do Not Hold? approaching this is to view the testing of underlying assumptions as a
framework for learning about the process. Assumption-testing
promotes insight into important aspects of the process that may not
have surfaced otherwise.
Primary Goal The primary goal is to have correct, validated, and complete
is Correct and scientific/engineering conclusions flowing from the analysis. This
Valid usually includes intermediate goals such as the derivation of a
Scientific good-fitting model and the computation of realistic parameter
Conclusions estimates. It should always include the ultimate goal of an
understanding and a "feel" for "what makes the process tick". There is
no more powerful catalyst for discovery than the bringing together of
an experienced/expert scientist/engineer and a data set ripe with
intriguing "anomalies" and characteristics.
Consequences If the run sequence plot does not support the assumption of fixed
of Non-Fixed location, then
Location 1. The location may be drifting.
2. The single location estimate may be meaningless (if the process
is drifting).
3. The choice of location estimator (e.g., the sample mean) may be
sub-optimal.
4. The usual formula for the uncertainty of the mean:
Consequences If the run sequence plot does not support the assumption of fixed
of Non-Fixed variation, then
Variation 1. The variation may be drifting.
2. The single variation estimate may be meaningless (if the process
variation is drifting).
3. The variation estimate may be poor.
4. The variation estimate may be biased.
Case Studies The airplane glass failure case study gives an example of determining
an appropriate distribution and estimating the parameters of that
distribution. The uniform random numbers case study gives an
example of determining a more appropriate centrality parameter for a
non-normal distribution.
Table of 1. Introduction
Contents for 2. Analysis Questions
Section 3
3. Graphical Techniques: Alphabetical
4. Graphical Techniques: By Problem Category
5. Quantitative Techniques: Alphabetical
6. Probability Distributions
1.3.1. Introduction
Graphical This section describes many techniques that are commonly used in
and exploratory and classical data analysis. This list is by no means meant
Quantitative to be exhaustive. Additional techniques (both graphical and
Techniques quantitative) are discussed in the other chapters. Specifically, the
product comparisons chapter has a much more detailed description of
many classical statistical techniques.
EDA emphasizes graphical techniques while classical techniques
emphasize quantitative techniques. In practice, an analyst typically
uses a mixture of graphical and quantitative techniques. In this section,
we have divided the descriptions into graphical and quantitative
techniques. This is for organizational clarity and is not meant to
discourage the use of both graphical and quantitiative techniques when
analyzing data.
Use of This section emphasizes the techniques themselves; how the graph or
Techniques test is defined, published references, and sample output. The use of the
Shown in techniques to answer engineering questions is demonstrated in the case
Case Studies studies section. The case studies do not demonstrate all of the
techniques.
Availability The sample plots and output in this section were generated with the
in Software Dataplot software program. Other general purpose statistical data
analysis programs can generate most of the plots, intervals, and tests
discussed here, or macros can be written to acheive the same result.
Analyst A critical early step in any analysis is to identify (for the engineering
Should problem at hand) which of the above questions are relevant. That is, we
Identify need to identify which questions we want answered and which questions
Relevant have no bearing on the problem at hand. After collecting such a set of
Questions questions, an equally important step, which is invaluable for maintaining
for his focus, is to prioritize those questions in decreasing order of importance.
Engineering EDA techniques are tied in with each of the questions. There are some
Problem EDA techniques (e.g., the scatter plot) that are broad-brushed and apply
almost universally. On the other hand, there are a large number of EDA
techniques that are specific and whose specificity is tied in with one of
the above questions. Clearly if one chooses not to explicitly identify
relevant questions, then one cannot take advantage of these
question-specific EDA technqiues.
Linear Intercept Linear Slope Plot: Linear Residual Mean Plot: 1.3.3.20
Plot: 1.3.3.17 1.3.3.18 Standard Deviation
Plot: 1.3.3.19
6-Plot: 1.3.3.33
Sample Plot:
Autocorrelations
should be
near-zero for
randomness.
Such is not the
case in this
example and
thus the
randomness
assumption fails
This sample autocorrelation plot shows that the time series is not
random, but rather has a high degree of autocorrelation between
adjacent and near-adjacent observations.
Importance: Randomness (along with fixed model, fixed variation, and fixed
Ensure validity distribution) is one of the four assumptions that typically underlie all
of engineering measurement processes. The randomness assumption is critically
conclusions important for the following three reasons:
1. Most standard statistical tests depend on randomness. The
validity of the test conclusions is directly linked to the
validity of the randomness assumption.
2. Many commonly-used statistical formulae depend on the
randomness assumption, the most common formula being the
formula for determining the standard deviation of the sample
mean:
Case Study The autocorrelation plot is demonstrated in the beam deflection data
case study.
Recommended The next step would be to estimate the parameters for the
Next Step autoregressive model:
Conclusions We can make the following conclusions from the above plot.
1. The data come from an underlying autoregressive model with
strong positive autocorrelation.
Discussion The plot starts with a high autocorrelation at lag 1 (only slightly less
than 1) that slowly declines. It continues decreasing until it becomes
negative and starts showing an incresing negative autocorrelation.
The decreasing autocorrelation is generally linear with little noise.
Such a pattern is the autocorrelation plot signature of "strong
autocorrelation", which in turn provides high predictability if
modeled properly.
Recommended The next step would be to estimate the parameters for the
Next Step autoregressive model:
Conclusions We can make the following conclusions from the above plot.
1. The data come from an underlying sinusoidal model.
1.3.3.2. Bihistogram
Purpose: The bihistogram is an EDA tool for assessing whether a
Check for a before-versus-after engineering modification has caused a change in
change in ● location;
location,
● variation; or
variation, or
distribution ● distribution.
Sample Plot:
This
bihistogram
reveals that
there is a
significant
difference in
ceramic
breaking
strength
between
batch 1
(above) and
batch 2
(below)
factor has a significant effect on the location (typical value) for strength
and hence batch is said to be "significant" or to "have an effect". We
thus see graphically and convincingly what a t-test or analysis of
variance would indicate quantitatively.
Case Study The bihistogram is demonstrated in the ceramic strength data case
study.
Sample
Plot:
Weld
method 2 is
lower
(better) than
weld method
1 in 10 of 12
cases
This block plot reveals that in 10 of the 12 cases (bars), weld method 2
is lower (better) than weld method 1. From a binomial point of view,
weld method is statistically significant.
Discussion: Average number of defective lead wires per hour from a study with four
Primary factors,
factor is 1. weld strength (2 levels)
denoted by
2. plant (2 levels)
plot
character: 3. speed (2 levels)
within-bar 4. shift (3 levels)
plot are shown in the plot above. Weld strength is the primary factor and the
character. other three factors are nuisance factors. The 12 distinct positions along
the horizontal axis correspond to all possible combinations of the three
nuisance factors, i.e., 12 = 2 plants x 2 speeds x 3 shifts. These 12
conditions provide the framework for assessing whether any conclusions
about the 2 levels of the primary factor (weld method) can truly be
called "general conclusions". If we find that one weld method setting
does better (smaller average defects per hour) than the other weld
method setting for all or most of these 12 nuisance factor combinations,
then the conclusion is in fact general and robust.
Ordering In the above chart, the ordering along the horizontal axis is as follows:
along the ● The left 6 bars are from plant 1 and the right 6 bars are from plant
horizontal 2.
axis
● The first 3 bars are from speed 1, the next 3 bars are from speed
2, the next 3 bars are from speed 1, and the last 3 bars are from
speed 2.
● Bars 1, 4, 7, and 10 are from the first shift, bars 2, 5, 8, and 11 are
from the second shift, and bars 3, 6, 9, and 12 are from the third
shift.
Setting 2 is In the block plot for the first bar (plant 1, speed 1, shift 1), weld method
better than 1 yields about 28 defects per hour while weld method 2 yields about 22
setting 1 in defects per hour--hence the difference for this combination is about 6
10 out of 12 defects per hour and weld method 2 is seen to be better (smaller number
cases of defects per hour).
Is "weld method 2 is better than weld method 1" a general conclusion?
For the second bar (plant 1, speed 1, shift 2), weld method 1 is about 37
while weld method 2 is only about 18. Thus weld method 2 is again seen
to be better than weld method 1. Similarly for bar 3 (plant 1, speed 1,
shift 3), we see weld method 2 is smaller than weld method 1. Scanning
over all of the 12 bars, we see that weld method 2 is smaller than weld
method 1 in 10 of the 12 cases, which is highly suggestive of a robust
weld method effect.
Questions The block plot can provide answers to the following questions:
1. Is the factor of interest significant?
2. Does the factor of interest have an effect?
3. Does the location change between levels of the primary factor?
4. Has the process improved?
5. What is the best setting (= level) of the primary factor?
6. How much of an average improvement can we expect with this
best setting of the primary factor?
7. Is there an interaction between the primary factor and one or more
nuisance factors?
8. Does the effect of the primary factor change depending on the
setting of some nuisance factor?
Case Study The block plot is demonstrated in the ceramic strength data case study.
Software Block plots can be generated with the Dataplot software program. They
are not currently available in other statistical software programs.
Sample
Plot:
This bootstrap plot was generated from 500 uniform random numbers.
Bootstrap plots and corresponding histograms were generated for the
mean, median, and mid-range. The histograms for the corresponding
statistics clearly show that for uniform random numbers the mid-range
has the smallest variance and is, therefore, a superior location estimator
to the mean or the median.
The bootstrap plot is simply the computed value of the statistic versus
the subsample number. That is, the bootstrap plot generates the values
for the desired statistic. This is usually immediately followed by a
histogram or some other distributional plot to show the location and
variation of the sampling distribution of the statistic.
Cautuion on The bootstrap is not appropriate for all distributions and statistics (Efron
use of the and Tibrashani). For example, because of the shape of the uniform
bootstrap distribution, the bootstrap is not appropriate for estimating the
distribution of statistics that are heavily dependent on the tails, such as
the range.
Related Histogram
Techniques Jackknife
The jacknife is a technique that is closely related to the bootstrap. The
jackknife is beyond the scope of this handbook. See the Efron and Gong
article for a discussion of the jackknife.
Case Study The bootstrap plot is demonstrated in the uniform random numbers case
study.
Sample Plot
The plot of the original data with the predicted values from a linear fit
indicate that a quadratic fit might be preferable. The Box-Cox
linearity plot shows a value of = 2.0. The plot of the transformed
data with the predicted values from a linear fit with the transformed
data shows a better fit (verified by the significant reduction in the
residual standard deviation).
Questions The Box-Cox linearity plot can provide answers to the following
questions:
1. Would a suitable transformation improve my fit?
2. What is the optimal value of the transformation parameter?
Case Study The Box-Cox linearity plot is demonstrated in the Alaska pipeline
data case study.
Software Box-Cox linearity plots are not a standard part of most general
purpose statistical software programs. However, the underlying
technique is based on a transformation and computing a correlation
coefficient. So if a statistical program supports these capabilities,
writing a macro for a Box-Cox linearity plot should be feasible.
Dataplot supports a Box-Cox linearity plot directly.
Sample Plot
The histogram in the upper left-hand corner shows a data set that has
significant right skewness (and so does not follow a normal
distribution). The Box-Cox normality plot shows that the maximum
value of the correlation coefficient is at = -0.3. The histogram of the
data after applying the Box-Cox transformation with = -0.3 shows a
data set for which the normality assumption is reasonable. This is
verified with a normal probability plot of the transformed data.
Questions The Box-Cox normality plot can provide answers to the following
questions:
1. Is there a transformation that will normalize my data?
2. What is the optimal value of the transformation parameter?
Importance: Normality assumptions are critical for many univariate intervals and
Normalization hypothesis tests. It is important to test the normality assumption. If the
Improves data are in fact clearly not normal, the Box-Cox normality plot can
Validity of often be used to find a transformation that will approximately
Tests normalize the data.
Software Box-Cox normality plots are not a standard part of most general
purpose statistical software programs. However, the underlying
technique is based on a normal probability plot and computing a
correlation coefficient. So if a statistical program supports these
capabilities, writing a macro for a Box-Cox normality plot should be
feasible. Dataplot supports a Box-Cox normality plot directly.
Sample
Plot:
This box
plot reveals
that
machine has
a significant
effect on
energy with
respect to
location and
possibly
variation
This box plot, comparing four machines for energy output, shows that
machine has a significant effect on energy with respect to both location
and variation. Machine 3 has the highest energy response (about 72.5);
machine 4 has the least variable energy response with about 50% of its
readings being within 1 energy unit.
Single or A single box plot can be drawn for one batch of data with no distinct
multiple box groups. Alternatively, multiple box plots can be drawn together to
plots can be compare multiple data sets or to compare groups in a single data set. For
drawn a single box plot, the width of the box is arbitrary. For multiple box
plots, the width of the box plot can be set proportional to the number of
points in the given group or sample (some software implementations of
the box plot simply set all the boxes to the same width).
Box plots There is a useful variation of the box plot that more specifically
with fences identifies outliers. To create this variation:
1. Calculate the median and the lower and upper quartiles.
2. Plot a symbol at the median and draw a box between the lower
and upper quartiles.
3. Calculate the interquartile range (the difference between the upper
and lower quartile) and call it IQ.
4. Calculate the following points:
L1 = lower quartile - 1.5*IQ
L2 = lower quartile - 3.0*IQ
U1 = upper quartile + 1.5*IQ
U2 = upper quartile + 3.0*IQ
5. The line from the lower quartile to the minimum is now drawn
from the lower quartile to the smallest point that is greater than
L1. Likewise, the line from the upper quartile to the maximum is
now drawn to the largest point smaller than U1.
Questions The box plot can provide answers to the following questions:
1. Is a factor significant?
2. Does the location differ between subgroups?
3. Does the variation differ between subgroups?
4. Are there any outliers?
Importance: The box plot is an important EDA tool for determining if a factor has a
Check the significant effect on the response with respect to either location or
significance variation.
of a factor
The box plot is also an effective tool for summarizing large quantities of
information.
Case Study The box plot is demonstrated in the ceramic strength data case study.
Software Box plots are available in most general purpose statistical software
programs, including Dataplot.
where is some type of linear model fit with standard least squares.
The most common case is a linear fit, that is the model becomes
Sample
Plot:
Case Study The complex demodulation amplitude plot is demonstrated in the beam
deflection data case study.
Software Complex demodulation amplitude plots are available in some, but not
most, general purpose statistical software programs. Dataplot supports
complex demodulation amplitude plots.
Sample
Plot:
The mathematical computations for the phase plot are beyond the scope
of the Handbook. Consult Granger (Granger, 1964) for details.
Questions The complex demodulation phase plot answers the following question:
Is the specified demodulation frequency correct?
Case Study The complex demodulation amplitude plot is demonstrated in the beam
deflection data case study.
Software Complex demodulation phase plots are available in some, but not most,
general purpose statistical software programs. Dataplot supports
complex demodulation phase plots.
Sample Plot:
This contour plot shows that the surface is symmetric and peaks in the
center.
Importance: For univariate data, a run sequence plot and a histogram are considered
Visualizing necessary first steps in understanding the data. For 2-dimensional data,
3-dimensional a scatter plot is a necessary first step in understanding the data.
data
In a similar manner, 3-dimensional data should be plotted. Small data
sets, such as result from designed experiments, can typically be
represented by block plots, dex mean plots, and the like (here, "DEX"
stands for "Design of Experiments"). For large data sets, a contour plot
or a 3-D surface plot should be considered a necessary first step in
understanding the data.
DEX Contour The dex contour plot is a specialized contour plot used in the design of
Plot experiments. In particular, it is useful for full and fractional designs.
Software Contour plots are available in most general purpose statistical software
programs. They are also available in many general purpose graphics
and mathematics programs. These programs vary widely in the
capabilities for the contour plots they generate. Many provide just a
basic contour plot over a rectangular grid while others permit color
filled or shaded contours. Dataplot supports a fairly basic contour plot.
Construction The following are the primary steps in the construction of the dex contour
of DEX plot.
Contour Plot 1. The x and y axes of the plot represent the values of the first and
second factor (independent) variables.
2. The four vertex points are drawn. The vertex points are (-1,-1),
(-1,1), (1,1), (1,-1). At each vertex point, the average of all the
response values at that vertex point is printed.
3. Similarly, if there are center points, a point is drawn at (0,0) and the
average of the response values at the center points is printed.
4. The linear dex contour plot assumes the model:
The user specifies the target values for which contour lines will be
generated.
The above algorithm assumes a linear model for the design. Dex contour
plots can also be generated for the case in which we assume a quadratic
model for the design. The algebra for solving for U2 in terms of U1
becomes more complicated, but the fundamental idea is the same.
Quadratic models are needed for the case when the average for the center
points does not fall in the range defined by the vertex point (i.e., there is
curvature).
Sample DEX The following is a dex contour plot for the data used in the Eddy current
Contour Plot case study. The analysis in that case study demonstrated that X1 and X2
were the most important factors.
Interpretation From the above dex contour plot we can derive the following information.
of the Sample 1. Interaction significance;
DEX Contour
2. Best (data) setting for these 2 dominant factors;
Plot
Interaction Note the appearance of the contour plot. If the contour curves are linear,
Significance then that implies that the interaction term is not significant; if the contour
curves have considerable curvature, then that implies that the interaction
term is large and important. In our case, the contour curves do not have
considerable curvature, and so we conclude that the X1*X2 term is not
significant.
Best Settings To determine the best factor settings for the already-run experiment, we
first must define what "best" means. For the Eddy current data set used to
generate this dex contour plot, "best" means to maximize (rather than
minimize or hit a target) the response. Hence from the contour plot we
determine the best settings for the two dominant factors by simply
scanning the four vertices and choosing the vertex with the largest value
(= average response). In this case, it is (X1 = +1, X2 = +1).
As for factor X3, the contour plot provides no best setting information, and
so we would resort to other tools: the main effects plot, the interaction
effects matrix, or the ordered data to determine optimal X3 settings.
Case Study The Eddy current case study demonstrates the use of the dex contour plot
in the context of the analysis of a full factorial design.
Software DEX contour plots are available in many statistical software programs that
analyze data from designed experiments. Dataplot supports a linear dex
contour plot and it provides a macro for generating a quadratic dex contour
plot.
Sample Plot:
Factors 4, 2,
3, and 7 are
the Important
Factors.
Description For this sample plot, there are seven factors and each factor has two
of the Plot levels. For each factor, we define a distinct x coordinate for each level
of the factor. For example, for factor 1, level 1 is coded as 0.8 and level
2 is coded as 1.2. The y coordinate is simply the value of the response
variable. The solid horizontal line is drawn at the overall mean of the
response variable. The vertical dotted lines are added for clarity.
Although the plot can be drawn with an arbitrary number of levels for a
factor, it is really only useful when there are two or three levels for a
factor.
Questions The dex scatter plot can be used to answer the following questions:
1. Which factors are important with respect to location and scale?
2. Are there outliers?
Extension for
Interaction
Effects
Using the concept of the scatterplot matrix, the dex scatter plot can be
extended to display first order interaction effects.
Specifically, if there are k factors, we create a matrix of plots with k
rows and k columns. On the diagonal, the plot is simply a dex scatter
plot with a single factor. For the off-diagonal plots, we multiply the
values of Xi and Xj. For the common 2-level designs (i.e., each factor
has two levels) the values are typically coded as -1 and 1, so the
multiplied values are also -1 and 1. We then generate a dex scatter plot
for this interaction variable. This plot is called a dex interaction effects
plot and an example is shown below.
Interpretation We can first examine the diagonal elements for the main effects. These
of the Dex diagonal plots show a great deal of overlap between the levels for all
Interaction three factors. This indicates that location and scale effects will be
Effects Plot relatively small.
We can then examine the off-diagonal plots for the first order
interaction effects. For example, the plot in the first row and second
column is the interaction between factors X1 and X2. As with the main
effect plots, no clear patterns are evident.
Case Study The dex scatter plot is demonstrated in the ceramic strength data case
study.
Software Dex scatter plots are available in some general purpose statistical
software programs, although the format may vary somewhat between
these programs. They are essentially just scatter plots with the X
variable defined in a particular way, so it should be feasible to write
macros for dex scatter plots in most statistical software programs.
Dataplot supports a dex scatter plot.
Sample
Plot:
Factors 4, 2,
and 1 are
the Most
Important
Factors
Questions The dex mean plot can be used to answer the following questions:
1. Which factors are important? The dex mean plot does not provide
a definitive answer to this question, but it does help categorize
factors as "clearly important", "clearly not important", and
"borderline importance".
2. What is the ranking list of the important factors?
Extension Using the concept of the scatter plot matrix, the dex mean plot can be
for extended to display first-order interaction effects.
Interaction
Effects Specifically, if there are k factors, we create a matrix of plots with k
rows and k columns. On the diagonal, the plot is simply a dex mean plot
with a single factor. For the off-diagonal plots, measurements at each
level of the interaction are plotted versus level, where level is Xi times
Xj and Xi is the code for the ith main effect level and Xj is the code for
the jth main effect. For the common 2-level designs (i.e., each factor has
two levels) the values are typically coded as -1 and 1, so the multiplied
values are also -1 and 1. We then generate a dex mean plot for this
interaction variable. This plot is called a dex interaction effects plot and
an example is shown below.
DEX
Interaction
Effects Plot
This plot shows that the most significant factor is X1 and the most
significant interaction is between X1 and X3.
Case Study The dex mean plot and the dex interaction effects plot are demonstrated
in the ceramic strength data case study.
Software Dex mean plots are available in some general purpose statistical
software programs, although the format may vary somewhat between
these programs. It may be feasible to write macros for dex mean plots in
some statistical software programs that do not support this plot directly.
Dataplot supports both a dex mean plot and a dex interaction effects
plot.
Sample Plot
Questions The dex standard deviation plot can be used to answer the following
questions:
1. How do the standard deviations vary across factors?
2. How do the standard deviations vary within a factor?
3. Which are the most important factors with respect to scale?
4. What is the ranked list of the important factors with respect to
scale?
Importance: The goal with many designed experiments is to determine which factors
Assess are significant. This is usually determined from the means of the factor
Variability levels (which can be conveniently shown with a dex mean plot). A
secondary goal is to assess the variability of the responses both within a
factor and between factors. The dex standard deviation plot is a
convenient way to do this.
Case Study The dex standard deviation plot is demonstrated in the ceramic strength
data case study.
Software Dex standard deviation plots are not available in most general purpose
statistical software programs. It may be feasible to write macros for dex
standard deviation plots in some statistical software programs that do
not support them directly. Dataplot supports a dex standard deviation
plot.
1.3.3.14. Histogram
Purpose: The purpose of a histogram (Chambers) is to graphically summarize the
Summarize distribution of a univariate data set.
a Univariate
Data Set The histogram graphically shows the following:
1. center (i.e., the location) of the data;
2. spread (i.e., the scale) of the data;
3. skewness of the data;
4. presence of outliers; and
5. presence of multiple modes in the data.
These features provide strong indications of the proper distributional
model for the data. The probability plot or a goodness-of-fit test can be
used to verify the distributional model.
The examples section shows the appearance of a number of common
features revealed by histograms.
Sample Plot
Definition The most common form of the histogram is obtained by splitting the
range of the data into equal-sized bins (called classes). Then for each
bin, the number of points from the data set that fall into each bin are
counted. That is
● Vertical axis: Frequency (i.e., counts for each bin)
The classes can either be defined arbitrarily by the user or via some
systematic rule. A number of theoretically derived rules have been
proposed by Scott (Scott 1992).
Examples 1. Normal
2. Symmetric, Non-Normal, Short-Tailed
3. Symmetric, Non-Normal, Long-Tailed
4. Symmetric and Bimodal
5. Bimodal Mixture of 2 Normals
6. Skewed (Non-Symmetric) Right
7. Skewed (Non-Symmetric) Left
8. Symmetric with Outlier
Frequency Plot
Stem and Leaf Plot
Density Trace
Case Study The histogram is demonstrated in the heat flow meter data case study.
Discussion of The histogram shown above illustrates data from a bimodal (2 peak)
Unimodal and distribution.
Bimodal
In contrast to the previous example, this example illustrates bimodality
due not to an underlying deterministic model, but bimodality due to a
mixture of probability models. In this case, each of the modes appears
to have a rough bell-shaped component. One could easily imagine the
above histogram being generated by a process consisting of two
normal distributions with the same standard deviation but with two
different locations (one centered at approximately 9.17 and the other
centered at approximately 9.26). If this is the case, then the research
challenge is to determine physically why there are two similar but
separate sub-processes.
Recommended If the histogram indicates that the data might be appropriately fit with
Next Steps a mixture of two normal distributions, the recommended next step is:
Fit the normal mixture model using either least squares or maximum
likelihood. The general normal mixing model is
Some Causes Skewed data often occur due to lower or upper bounds on the data.
for Skewed That is, data that have a lower bound are often skewed right while data
Data that have an upper bound are often skewed left. Skewness can also
result from start-up effects. For example, in reliability applications
some processes may have a large number of initial failures that could
cause left skewness. On the other hand, a reliability process could
have a long start-up period where failures are rare resulting in
right-skewed data.
Data collected in scientific and engineering applications often have a
lower bound of zero. For example, failure data must be non-negative.
Many measurement processes generate only positive data. Time to
occurence and size are common measurements that cannot be less than
zero.
❍ Gamma family
❍ Chi-square family
❍ Lognormal family
❍ Power lognormal family
3. Consider a normalizing transformation such as the Box-Cox
transformation.
The issues for skewed left data are similar to those for skewed right
data.
6. warm-up effects
to more subtle causes such as
1. A change in settings of factors that (knowingly or unknowingly)
affect the response.
2. Nature is trying to tell us something.
Recommended If the histogram shows the presence of outliers, the recommended next
Next Steps steps are:
1. Graphically check for outliers (in the commonly encountered
normal case) by generating a box plot. In general, box plots are
a much better graphical tool for detecting outliers than are
histograms.
2. Quantitatively check for outliers (in the commonly encountered
normal case) by carrying out Grubbs test which indicates how
many sample standard deviations away from the sample mean
are the data in question. Large values indicate outliers.
Sample Plot
This sample lag plot exhibits a linear pattern. This shows that the data
are strongly non-random and further suggests that an autoregressive
model might be appropriate.
Definition A lag is a fixed time displacement. For example, given a data set Y1, Y2
..., Yn, Y2 and Y7 have lag 5 since 7 - 2 = 5. Lag plots can be generated
for any arbitrary lag, although the most commonly used lag is 1.
A plot of lag 1 is a plot of the values of Yi versus Yi-1
● Vertical axis: Yi for all i
● Horizontal axis: Yi-1 for all i
Case Study The lag plot is demonstrated in the beam deflection data case study.
Software Lag plots are not directly available in most general purpose statistical
software programs. Since the lag plot is essentially a scatter plot with
the 2 variables properly lagged, it should be feasible to write a macro for
the lag plot in most statistical programs. Dataplot supports a lag plot.
Conclusions We can make the following conclusions based on the above plot.
1. The data are random.
2. The data exhibit no autocorrelation.
3. The data contain no outliers.
Discussion The lag plot shown above is for lag = 1. Note the absence of structure.
One cannot infer, from a current value Yi-1, the next value Yi. Thus for a
known value Yi-1 on the horizontal axis (say, Yi-1 = +0.5), the Yi-th
value could be virtually anything (from Yi = -2.5 to Yi = +1.5). Such
non-association is the essence of randomness.
Discussion In the plot above for lag = 1, note how the points tend to cluster (albeit
noisily) along the diagonal. Such clustering is the lag plot signature of
moderate autocorrelation.
If the process were completely random, knowledge of a current
observation (say Yi-1 = 0) would yield virtually no knowledge about
the next observation Yi. If the process has moderate autocorrelation, as
above, and if Yi-1 = 0, then the range of possible values for Yi is seen
to be restricted to a smaller range (.01 to +.01). This suggests
prediction is possible using an autoregressive model.
Conclusions We can make the following conclusions based on the above plot.
1. The data come from an underlying autoregressive model with
strong positive autocorrelation
2. The data contain no outliers.
Discussion Note the tight clustering of points along the diagonal. This is the lag
plot signature of a process with strong positive autocorrelation. Such
processes are highly non-random--there is strong association between
an observation and a succeeding observation. In short, if you know
Yi-1 you can make a strong guess as to what Yi will be.
If the above process were completely random, the plot would have a
shotgun pattern, and knowledge of a current observation (say Yi-1 = 3)
would yield virtually no knowledge about the next observation Yi (it
could here be anywhere from -2 to +8). On the other hand, if the
process had strong autocorrelation, as seen above, and if Yi-1 = 3, then
the range of possible values for Yi is seen to be restricted to a smaller
range (2 to 4)--still wide, but an improvement nonetheless (relative to
-2 to +8) in predictive power.
Recommended When the lag plot shows a strongly autoregressive pattern and only
Next Step successive observations appear to be correlated, the next steps are to:
1. Extimate the parameters for the autoregressive model:
Since Yi and Yi-1 are precisely the axes of the lag plot, such
estimation is a linear regression straight from the lag plot.
Conclusions We can make the following conclusions based on the above plot.
1. The data come from an underlying single-cycle sinusoidal
model.
2. The data contain three outliers.
Discussion In the plot above for lag = 1, note the tight elliptical clustering of
points. Processes with a single-cycle sinusoidal model will have such
elliptical lag plots.
Consequences If one were to naively assume that the above process came from the
of Ignoring null model
Cyclical
Pattern
and then estimate the constant by the sample mean, then the analysis
would suffer because
1. the sample mean would be biased and meaningless;
2. the confidence limits would be meaningless and optimistically
small.
The proper model
Unexpected Frequently a technique (e.g., the lag plot) is constructed to check one
Value of EDA aspect (e.g., randomness) which it does well. Along the way, the
technique also highlights some other anomaly of the data (namely, that
there are 3 outliers). Such outlier identification and removal is
extremely important for detecting irregularities in the data collection
system, and also for arriving at a "purified" data set for modeling. The
lag plot plays an important role in such outlier identification.
Recommended When the lag plot indicates a sinusoidal model with possible outliers,
Next Step the recommended next steps are:
1. Do a spectral plot to obtain an initial estimate of the frequency
of the underlying cycle. This will be helpful as a starting value
for the subsequent non-linear fitting.
2. Omit the outliers.
3. Carry out a non-linear fit of the model to the 197 points.
Sample Plot
This linear correlation plot shows that the correlations are high for all
groups. This implies that linear fits could provide a good model for
each of these groups.
Questions The linear correlation plot can be used to answer the following
questions.
1. Are there linear relationships across groups?
2. Are the strength of the linear relationships relatively constant
across the groups?
Importance: For grouped data, it may be important to know whether the different
Checking groups are homogeneous (i.e., similar) or heterogeneous (i.e., different).
Group Linear correlation plots help answer this question in the context of
Homogeneity linear fitting.
Case Study The linear correlation plot is demonstrated in the Alaska pipeline data
case study.
In some cases you might not have groups. Instead, you have different
data sets and you want to know if the same fit can be adequately applied
to each of the data sets. In this case, simply think of each distinct data
set as a group and apply the linear intercept plot as for groups.
Sample Plot
the other groups. Note that these are small differences in the intercepts.
Questions The linear intercept plot can be used to answer the following questions.
1. Is the intercept from linear fits relatively constant across groups?
2. If the intercepts vary across groups, is there a discernible pattern?
Importance: For grouped data, it may be important to know whether the different
Checking groups are homogeneous (i.e., similar) or heterogeneous (i.e., different).
Group Linear intercept plots help answer this question in the context of linear
Homogeneity fitting.
Case Study The linear intercept plot is demonstrated in the Alaska pipeline data
case study.
In some cases you might not have groups. Instead, you have different
data sets and you want to know if the same fit can be adequately applied
to each of the data sets. In this case, simply think of each distinct data
set as a group and apply the linear slope plot as for groups.
Sample Plot
This linear slope plot shows that the slopes are about 0.174 (plus or
minus 0.002) for all groups. There does not appear to be a pattern in the
variation of the slopes. This implies that a single fit may be adequate.
Questions The linear slope plot can be used to answer the following questions.
1. Do you get the same slope across groups for linear fits?
2. If the slopes differ, is there a discernible pattern in the slopes?
Importance: For grouped data, it may be important to know whether the different
Checking groups are homogeneous (i.e., similar) or heterogeneous (i.e., different).
Group Linear slope plots help answer this question in the context of linear
Homogeneity fitting.
Case Study The linear slope plot is demonstrated in the Alaska pipeline data case
study.
Sample Plot
This linear RESSD plot shows that the residual standard deviations
from a linear fit are about 0.0025 for all the groups.
Questions The linear RESSD plot can be used to answer the following questions.
1. Is the residual standard deviation from a linear fit constant across
groups?
2. If the residual standard deviations vary, is there a discernible
pattern across the groups?
Importance: For grouped data, it may be important to know whether the different
Checking groups are homogeneous (i.e., similar) or heterogeneous (i.e., different).
Group Linear RESSD plots help answer this question in the context of linear
Homogeneity fitting.
Case Study The linear residual standard deviation plot is demonstrated in the
Alaska pipeline data case study.
Sample Plot
This sample mean plot shows a shift of location after the 6th month.
Questions The mean plot can be used to answer the following questions.
1. Are there any shifts in location?
2. What is the magnitude of the shifts in location?
3. Is there a distinct pattern in the shifts in location?
Sample Plot
The points on this plot form a nearly linear pattern, which indicates
that the normal distribution is a good model for this data set.
Questions The normal probability plot is used to answer the following questions.
1. Are the data normally distributed?
2. What is the nature of the departure from normality (data
skewed, shorter than expected tails, longer than expected tails)?
Importance: The underlying assumptions for a measurement process are that the
Check data should behave like:
Normality 1. random drawings;
Assumption
2. from a fixed distribution;
3. with fixed location;
4. with fixed scale.
Probability plots are used to assess the assumption of a fixed
distribution. In particular, most statistical models are of the form:
response = deterministic + random
where the deterministic part is the fit and the random part is error. This
error component in most common statistical models is specifically
assumed to be normally distributed with fixed location and scale. This
is the most frequent application of normal probability plots. That is, a
model is fit and a normal probability plot is generated for the residuals
from the fitted model. If the residuals from the fitted model are not
normally distributed, then one of the major assumptions of the model
has been violated.
Related Histogram
Techniques Probability plots for other distributions (e.g., Weibull)
Probability plot correlation coefficient plot (PPCC plot)
Anderson-Darling Goodness-of-Fit Test
Chi-Square Goodness-of-Fit Test
Kolmogorov-Smirnov Goodness-of-Fit Test
Case Study The normal probability plot is demonstrated in the heat flow meter
data case study.
Conclusions We can make the following conclusions from the above plot.
1. The normal probability plot shows a strongly linear pattern. There
are only minor deviations from the line fit to the points on the
probability plot.
2. The normal distribution appears to be a good model for these
data.
Discussion Visually, the probability plot shows a strongly linear pattern. This is
verified by the correlation coefficient of 0.9989 of the line fit to the
probability plot. The fact that the points in the lower and upper extremes
of the plot do not deviate significantly from the straight-line pattern
indicates that there are not any significant outliers (relative to a normal
distribution).
In this case, we can quite reasonably conclude that the normal
distribution provides an excellent model for the data. The intercept and
slope of the fitted line give estimates of 9.26 and 0.023 for the location
and scale parameters of the fitted normal distribution.
Conclusions We can make the following conclusions from the above plot.
1. The normal probability plot shows a non-linear pattern.
2. The normal distribution is not a good model for these data.
Discussion For data with short tails relative to the normal distribution, the
non-linearity of the normal probability plot shows up in two ways. First,
the middle of the data shows an S-like pattern. This is common for both
short and long tails. Second, the first few and the last few points show a
marked departure from the reference fitted line. In comparing this plot
to the long tail example in the next section, the important difference is
the direction of the departure from the fitted line for the first few and
last few points. For short tails, the first few points show increasing
departure from the fitted line above the line and last few points show
increasing departure from the fitted line below the line. For long tails,
this pattern is reversed.
In this case, we can reasonably conclude that the normal distribution
does not provide an adequate fit for this data set. For probability plots
that indicate short-tailed distributions, the next step might be to generate
a Tukey Lambda PPCC plot. The Tukey Lambda PPCC plot can often
be helpful in identifying an appropriate distributional family.
Conclusions We can make the following conclusions from the above plot.
1. The normal probability plot shows a reasonably linear pattern in
the center of the data. However, the tails, particularly the lower
tail, show departures from the fitted line.
2. A distribution other than the normal distribution would be a good
model for these data.
Discussion For data with long tails relative to the normal distribution, the
non-linearity of the normal probability plot can show up in two ways.
First, the middle of the data may show an S-like pattern. This is
common for both short and long tails. In this particular case, the S
pattern in the middle is fairly mild. Second, the first few and the last few
points show marked departure from the reference fitted line. In the plot
above, this is most noticeable for the first few data points. In comparing
this plot to the short-tail example in the previous section, the important
difference is the direction of the departure from the fitted line for the
first few and the last few points. For long tails, the first few points show
increasing departure from the fitted line below the line and last few
points show increasing departure from the fitted line above the line. For
short tails, this pattern is reversed.
In this case we can reasonably conclude that the normal distribution can
be improved upon as a model for these data. For probability plots that
indicate long-tailed distributions, the next step might be to generate a
Tukey Lambda PPCC plot. The Tukey Lambda PPCC plot can often be
helpful in identifying an appropriate distributional family.
Conclusions We can make the following conclusions from the above plot.
1. The normal probability plot shows a strongly non-linear pattern.
Specifically, it shows a quadratic pattern in which all the points
are below a reference line drawn between the first and last points.
2. The normal distribution is not a good model for these data.
Discussion This quadratic pattern in the normal probability plot is the signature of a
significantly right-skewed data set. Similarly, if all the points on the
normal probability plot fell above the reference line connecting the first
and last points, that would be the signature pattern for a significantly
left-skewed data set.
In this case we can quite reasonably conclude that we need to model
these data with a right skewed distribution such as the Weibull or
lognormal.
Sample Plot
● What are good estimates for the location and scale parameters of
the chosen distribution?
Importance: The discussion for the normal probability plot covers the use of
Check probability plots for checking the fixed distribution assumption.
distributional
assumption Some statistical models assume data have come from a population with
a specific type of distribution. For example, in reliability applications,
the Weibull, lognormal, and exponential are commonly used
distributional models. Probability plots can be useful for checking this
distributional assumption.
Related Histogram
Techniques Probability Plot Correlation Coefficient (PPCC) Plot
Hazard Plot
Quantile-Quantile Plot
Anderson-Darling Goodness of Fit
Chi-Square Goodness of Fit
Kolmogorov-Smirnov Goodness of Fit
Case Study The probability plot is demonstrated in the airplane glass failure time
data case study.
Use When comparing distributional models, do not simply choose the one
Judgement with the maximum PPCC value. In many cases, several distributional
When fits provide comparable PPCC values. For example, a lognormal and
Selecting An Weibull may both fit a given set of reliability data quite well.
Appropriate Typically, we would consider the complexity of the distribution. That
Distributional is, a simpler distribution with a marginally smaller PPCC value may
Family be preferred over a more complex distribution. Likewise, there may be
theoretical justification in terms of the underlying scientific model for
preferring a distribution with a marginally smaller PPCC value in
some cases. In other cases, we may not need to know if the
distributional model is optimal, only that it is adequate for our
purposes. That is, we may be able to use techniques designed for
normally distributed data even if other distributions fit the data
somewhat better.
Sample Plot The following is a PPCC plot of 100 normal random numbers. The
maximum value of the correlation coefficient = 0.997 at = 0.099.
Case Study The PPCC plot is demonstrated in the airplane glass failure data case
study.
Software PPCC plots are currently not available in most common general
purpose statistical software programs. However, the underlying
technique is based on probability plots and correlation coefficients, so
it should be possible to write macros for PPCC plots in statistical
programs that support these capabilities. Dataplot supports PPCC
plots.
Sample Plot
Importance: When there are two data samples, it is often desirable to know if the
Check for assumption of a common distribution is justified. If so, then location and
Common scale estimators can pool both data sets to obtain estimates of the
Distribution common location and scale. If two samples do differ, it is also useful to
gain some understanding of the differences. The q-q plot can provide
more insight into the nature of the difference than analytical methods
such as the chi-square and Kolmogorov-Smirnov 2-sample tests.
Related Bihistogram
Techniques T Test
F Test
2-Sample Chi-Square Test
2-Sample Kolmogorov-Smirnov Test
Case Study The quantile-quantile plot is demonstrated in the ceramic strength data
case study.
Software Q-Q plots are available in some general purpose statistical software
programs, including Dataplot. If the number of data points in the two
samples are equal, it should be relatively easy to write a macro in
statistical programs that do not support the q-q plot. If the number of
points are not equal, writing a macro for a q-q plot may be difficult.
Sample
Plot:
Last Third
of Data
Shows a
Shift of
Location
This sample run sequence plot shows that the location shifts up for the
last third of the data.
Questions The run sequence plot can be used to answer the following questions
1. Are there any shifts in location?
2. Are there any shifts in variation?
3. Are there any outliers?
The run sequence plot can also give the analyst an excellent feel for the
data.
Case Study The run sequence plot is demonstrated in the Filter transmittance data
case study.
Software Run sequence plots are available in most general purpose statistical
software programs, including Dataplot.
Sample
Plot:
Linear
Relationship
Between
Variables Y
and X
This sample plot reveals a linear relationship between the two variables
indicating that a linear regression model might be appropriate.
Examples 1. No relationship
2. Strong linear (positive correlation)
3. Strong linear (negative correlation)
4. Exact linear (positive correlation)
5. Quadratic relationship
6. Exponential relationship
7. Sinusoidal relationship (damped)
8. Variation of Y doesn't depend on X (homoscedastic)
9. Variation of Y does depend on X (heteroscedastic)
10. Outlier
Combining Scatter plots can also be combined in multiple plots per page to help
Scatter Plots understand higher-level structure in data sets with more than two
variables.
The scatterplot matrix generates all pairwise scatter plots on a single
page. The conditioning plot, also called a co-plot or subset plot,
generates scatter plots of Y versus X dependent on the value of a third
variable.
Case Study The scatter plot is demonstrated in the load cell calibration data case
study.
Software Scatter plots are a fundamental technique that should be available in any
general purpose statistical software program, including Dataplot. Scatter
plots are also available in most graphics and spreadsheet programs as
well.
Discussion Note in the plot above how for a given value of X (say X = 0.5), the
corresponding values of Y range all over the place from Y = -2 to Y = +2.
The same is true for other values of X. This lack of predictablility in
determining Y from a given value of X, and the associated amorphous,
non-structured appearance of the scatter plot leads to the summary
conclusion: no relationship.
Discussion Note in the plot above how a straight line comfortably fits through the
data; hence a linear relationship exists. The scatter about the line is quite
small, so there is a strong linear relationship. The slope of the line is
positive (small values of X correspond to small values of Y; large values
of X correspond to large values of Y), so there is a positive co-relation
(that is, a positive correlation) between X and Y.
Discussion Note in the plot above how a straight line comfortably fits through the
data; hence there is a linear relationship. The scatter about the line is
quite small, so there is a strong linear relationship. The slope of the line
is negative (small values of X correspond to large values of Y; large
values of X correspond to small values of Y), so there is a negative
co-relation (that is, a negative correlation) between X and Y.
Discussion Note in the plot above how a straight line comfortably fits through the
data; hence there is a linear relationship. The scatter about the line is
zero--there is perfect predictability between X and Y), so there is an
exact linear relationship. The slope of the line is positive (small values
of X correspond to small values of Y; large values of X correspond to
large values of Y), so there is a positive co-relation (that is, a positive
correlation) between X and Y.
Discussion Note in the plot above how no imaginable simple straight line could
ever adequately describe the relationship between X and Y--a curved (or
curvilinear, or non-linear) function is needed. The simplest such
curvilinear function is a quadratic model
for some A, B, and C. Many other curvilinear functions are possible, but
the data analysis principle of parsimony suggests that we try fitting a
quadratic function first.
Discussion Note that a simple straight line is grossly inadequate in describing the
relationship between X and Y. A quadratic model would prove lacking,
especially for large values of X. In this example, the large values of X
correspond to nearly constant values of Y, and so a non-linear function
beyond the quadratic is needed. Among the many other non-linear
functions available, one of the simpler ones is the exponential model
Closer inspection of the scatter plot reveals that the amount of swing
(the amplitude in the model) does not appear to be constant but rather
is decreasing (damping) as X gets large. We thus would be led to the
conclusion: damped sinusoidal relationship, with the simplest
corresponding model being
Discussion This scatter plot reveals a linear relationship between X and Y: for a
given value of X, the predicted value of Y will fall on a line. The plot
further reveals that the variation in Y about the predicted value is
about the same (+- 10 units), regardless of the value of X.
Statistically, this is referred to as homoscedasticity. Such
homoscedasticity is very important as it is an underlying assumption
for regression, and its violation leads to parameter estimates with
inflated variances. If the data are homoscedastic, then the usual
regression estimates can be used. If the data are not homoscedastic,
then the estimates can be improved using weighting procedures as
shown in the next example.
name.
5. Some analysts prefer to connect the scatter plots. Others prefer to
leave a little gap between each plot.
6. Although this plot type is most commonly used for scatter plots,
the basic concept is both simple and powerful and extends easily
to other plot formats that involve pairwise plots such as the
quantile-quantile plot and the bihistogram.
Sample Plot
This sample plot was generated from pollution data collected by NIST
chemist Lloyd Currie.
There are a number of ways to view this plot. If we are primarily
interested in a particular variable, we can scan the row and column for
that variable. If we are interested in finding the strongest relationship,
we can scan all the plots and then determine which variables are
related.
Definition Given k variables, scatter plot matrices are formed by creating k rows
and k columns. Each row and column defines a single scatter plot
The individual plot for row i and column j is defined as
● Vertical axis: Variable Xi
Questions The scatterplot matrix can provide answers to the following questions:
1. Are there pairwise relationships between the variables?
2. If there are relationships, what is the nature of these
relationships?
3. Are there outliers in the data?
4. Is there clustering by groups in the data?
Linking and The scatterplot matrix serves as the foundation for the concepts of
Brushing linking and brushing.
By linking, we mean showing how a point, or set of points, behaves in
each of the plots. This is accomplished by highlighting these points in
some fashion. For example, the highlighted points could be drawn as a
filled circle while the remaining points could be drawn as unfilled
circles. A typical application of this would be to show how an outlier
shows up in each of the individual pairwise plots. Brushing extends this
concept a bit further. In brushing, the points to be highlighted are
interactively selected by a mouse and the scatterplot matrix is
dynamically updated (ideally in real time). That is, we can select a
rectangular region of points in one plot and see how those points are
reflected in the other plots. Brushing is discussed in detail by Becker,
Cleveland, and Wilks in the paper "Dynamic Graphics for Data
Analysis" (Cleveland and McGill, 1988).
4. Although this plot type is most commonly used for scatter plots,
the basic concept is both simple and powerful and extends easily
to other plot formats.
Sample Plot
In this case, temperature has six distinct values. We plot torque versus
time for each of these temperatures. This example is discussed in more
detail in the process modeling chapter.
where only the points in the group corresponding to the ith row and jth
column are used.
Questions The conditioning plot can provide answers to the following questions:
1. Is there a relationship between two variables?
2. If there is a relationship, does the nature of the relationship
depend on the value of a third variable?
3. Are groups in the data similar?
4. Are there outliers in the data?
Sample Plot
Questions The spectral plot can be used to answer the following questions:
1. How many cyclic components are there?
2. Is there a dominant cyclic frequency?
3. If there is a dominant cyclic frequency, what is it?
Importance The spectral plot is the primary technique for assessing the cyclic nature
Check of univariate time series in the frequency domain. It is almost always the
Cyclic second plot (after a run sequence plot) generated in a frequency domain
Behavior of analysis of a time series.
Time Series
Case Study The spectral plot is demonstrated in the beam deflection data case study.
Conclusions We can make the following conclusions from the above plot.
1. There are no dominant peaks.
2. There is no identifiable pattern in the spectrum.
3. The data are random.
Discussion For random data, the spectral plot should show no dominant peaks or
distinct pattern in the spectrum. For the sample plot above, there are no
clearly dominant peaks and the peaks seem to fluctuate at random. This
type of appearance of the spectral plot indicates that there are no
significant cyclic patterns in the data.
Conclusions We can make the following conclusions from the above plot.
1. Strong dominant peak near zero.
2. Peak decays rapidly towards zero.
3. An autoregressive model is an appropriate model.
Discussion This spectral plot starts with a dominant peak near zero and rapidly
decays to zero. This is the spectral plot signature of a process with
strong positive autocorrelation. Such processes are highly non-random
in that there is high association between an observation and a
succeeding observation. In short, if you know Yi you can make a
strong guess as to what Yi+1 will be.
Recommended The next step would be to determine the parameters for the
Next Step autoregressive model:
Conclusions We can make the following conclusions from the above plot.
1. There is a single dominant peak at approximately 0.3.
2. There is an underlying single-cycle sinusoidal model.
Discussion This spectral plot shows a single dominant frequency. This indicates
that a single-cycle sinusoidal model might be appropriate.
If one were to naively assume that the data represented by the graph
could be fit by the model
and then estimate the constant by the sample mean, the analysis would
be incorrect because
● the sample mean is biased;
● the confidence interval for the mean, which is valid only for
random data, is meaningless and too small.
On the other hand, the choice of the proper model
Sample Plot
Questions The standard deviation plot can be used to answer the following
questions.
1. Are there any shifts in variation?
2. What is the magnitude of the shifts in variation?
3. Is there a distinct pattern in the shifts in variation?
Sample Plot The plot below contains the star plots of 16 cars. The data file actually
contains 74 cars, but we restrict the plot to what can reasonably be
shown on one page. The variable list for the sample star plot is
1 Price
2 Mileage (MPG)
3 1978 Repair Record (1 = Worst, 5 = Best)
4 1977 Repair Record (1 = Worst, 5 = Best)
5 Headroom
6 Rear Seat Room
7 Trunk Space
8 Weight
9 Length
Definition The star plot consists of a sequence of equi-angular spokes, called radii,
with each spoke representing one of the variables. The data length of a
spoke is proportional to the magnitude of the variable for the data point
relative to the maximum magnitude of the variable across all data
points. A line is drawn connecting the data values for each spoke. This
gives the plot a star-like appearance and the origin of the name of this
plot.
Questions The star plot can be used to answer the following questions:
1. What variables are dominant for a given observation?
2. Which observations are most similar, i.e., are there clusters of
observations?
3. Are there outliers?
Weakness in Star plots are helpful for small-to-moderate-sized multivariate data sets.
Technique Their primary weakness is that their effectiveness is limited to data sets
with less than a few hundred points. After that, they tend to be
overwhelming.
Graphical techniques suited for large data sets are discussed by Scott.
Software Star plots are available in some general purpose statistical software
progams, including Dataplot.
Sample Plot
Questions The Weibull plot can be used to answer the following questions:
1. Do the data follow a 2-parameter Weibull distribution?
2. What is the best estimate of the shape parameter for the
2-parameter Weibull distribution?
3. What is the best estimate of the scale (= variation) parameter for
the 2-parameter Weibull distribution?
The Weibull probability plot (in conjunction with the Weibull PPCC
plot), the Weibull hazard plot, and the Weibull plot are all similar
techniques that can be used for assessing the adequacy of the Weibull
distribution as a model for the data, and additionally providing
estimation for the shape, scale, or location parameters.
The Weibull hazard plot and Weibull plot are designed to handle
censored data (which the Weibull probability plot does not).
Case Study The Weibull plot is demonstrated in the airplane glass failure data case
study.
Sample Plot
Questions The Youden plot can be used to answer the following questions:
1. Are all labs equivalent?
2. What labs have between-lab problems (reproducibility)?
3. What labs have within-lab problems (repeatability)?
4. What labs are outliers?
Importance In interlaboratory studies or in comparing two runs from the same lab, it
is useful to know if consistent results are generated. Youden plots
should be a routine plot for analyzing this type of data.
DEX Youden The dex Youden plot is a specialized Youden plot used in the design of
Plot experiments. In particular, it is useful for full and fractional designs.
Construction The following are the primary steps in the construction of the dex
of DEX Youden plot.
Youden Plot
1. For a given factor or interaction term, compute the mean of the
response variable for the low level of the factor and for the high
level of the factor. Any center points are omitted from the
computation.
2. Plot the point where the y-coordinate is the mean for the high
level of the factor and the x-coordinate is the mean for the low
level of the factor. The character used for the plot point should
identify the factor or interaction term (e.g., "1" for factor 1, "13"
for the interaction between factors 1 and 3).
3. Repeat steps 1 and 2 for each factor and interaction term of the
data.
The high and low values of the interaction terms are obtained by
multiplying the corresponding values of the main level factors. For
example, the interaction term X13 is obtained by multiplying the values
for X1 with the corresponding values of X3. Since the values for X1 and
X3 are either "-1" or "+1", the resulting values for X13 are also either
"-1" or "+1".
In summary, the dex Youden plot is a plot of the mean of the response
variable for the high level of a factor or interaction term against the
mean of the response variable for the low level of that factor or
interaction term.
For unimportant factors and interaction terms, these mean values
should be nearly the same. For important factors and interaction terms,
these mean values should be quite different. So the interpretation of the
plot is that unimportant factors should be clustered together near the
grand mean. Points that stand apart from this cluster identify important
factors that should be included in the model.
Sample DEX The following is a dex Youden plot for the data used in the Eddy
Youden Plot current case study. The analysis in that case study demonstrated that
X1 and X2 were the most important factors.
Interpretation From the above dex Youden plot, we see that factors 1 and 2 stand out
of the Sample from the others. That is, the mean response values for the low and high
DEX Youden levels of factor 1 and factor 2 are quite different. For factor 3 and the 2
Plot and 3-term interactions, the mean response values for the low and high
levels are similar.
We would conclude from this plot that factors 1 and 2 are important
and should be included in our final model while the remaining factors
and interactions should be omitted from the final model.
Case Study The Eddy current case study demonstrates the use of the dex Youden
plot in the context of the analysis of a full factorial design.
Software DEX Youden plots are not typically available as built-in plots in
statistical software programs. However, it should be relatively
straightforward to write a macro to generate this plot in most general
purpose statistical software programs.
1.3.3.32. 4-Plot
Purpose: The 4-plot is a collection of 4 specific EDA graphical techniques
Check whose purpose is to test the assumptions that underlie most
Underlying measurement processes. A 4-plot consists of a
Statistical 1. run sequence plot;
Assumptions
2. lag plot;
3. histogram;
4. normal probability plot.
If the 4 underlying assumptions of a typical measurement process
hold, then the above 4 plots will have a characteristic appearance (see
the normal random numbers case study below); if any of the
underlying assumptions fail to hold, then it will be revealed by an
anomalous appearance in one or more of the plots. Several commonly
encountered situations are demonstrated in the case studies below.
Although the 4-plot has an obvious use for univariate and time series
data, its usefulness extends far beyond that. Many statistical models of
the form
have the same underlying assumptions for the error term. That is, no
matter how complicated the functional fit, the assumptions on the
underlying error term are still the same. The 4-plot can and should be
routinely applied to the residuals when fitting models regardless of
whether the model is simple or complicated.
Sample Plot:
Process Has
Fixed
Location,
Fixed
Variation,
Non-Random
(Oscillatory),
Non-Normal
U-Shaped
Distribution,
and Has 3
Outliers.
❍ Horizontally: Y
Autocorrelation Plot
Spectral Plot
PPCC Plot
Case Studies The 4-plot is used in most of the case studies in this chapter:
1. Normal random numbers (the ideal)
2. Uniform random numbers
3. Random walk
4. Josephson junction cryothermometry
5. Beam deflections
6. Filter transmittance
7. Standard resistor
8. Heat flow meter 1
Software It should be feasible to write a macro for the 4-plot in any general
purpose statistical software program that supports the capability for
multiple plots per page and supports the underlying plot techniques.
Dataplot supports the 4-plot.
1.3.3.33. 6-Plot
Purpose: The 6-plot is a collection of 6 specific graphical techniques whose
Graphical purpose is to assess the validity of a Y versus X fit. The fit can be a
Model linear fit, a non-linear fit, a LOWESS (locally weighted least squares)
Validation fit, a spline fit, or any other fit utilizing a single independent variable.
The 6 plots are:
1. Scatter plot of the response and predicted values versus the
independent variable;
2. Scatter plot of the residuals versus the independent variable;
3. Scatter plot of the residuals versus the predicted values;
4. Lag plot of the residuals;
5. Histogram of the residuals;
6. Normal probability plot of the residuals.
Sample Plot
This 6-plot, which followed a linear fit, shows that the linear model is
not adequate. It suggests that a quadratic model would be a better
model.
5. Histogram of residuals
❍ Vertical axis: Counts
Case Study The 6-plot is used in the Alaska pipeline data case study.
Software It should be feasible to write a macro for the 6-plot in any general
purpose statistical software program that supports the capability for
multiple plots per page and supports the underlying plot techniques.
Dataplot supports the 6-plot.
1.3.4. Graphical
Techniques: By
Problem
Category
Univariate
y=c+e
Time Series
y = f(t) + e
Complex Complex
Demodulation Demodulation
Amplitude Plot: Phase Plot:
1.3.3.8 1.3.3.9
1 Factor
y = f(x) + e
Multi-Factor/Comparative
y = f(xp, x1,x2,...,xk) + e
Block Plot:
1.3.3.3
Multi-Factor/Screening
y = f(x1,x2,x3,...,xk) + e
Contour Plot:
1.3.3.10
Regression
y = f(x1,x2,x3,...,xk) + e
Interlab
(y1,y2) = f(x) + e
Youden Plot:
1.3.3.31
Multivariate
(y1,y2,...,yp)
Star Plot:
1.3.3.29
Hypothesis Hypothesis tests also address the uncertainty of the sample estimate.
Tests However, instead of providing an interval, a hypothesis test attempts to
refute a specific claim about a population parameter based on the
sample data. For example, the hypothesis might be one of the
following:
● the population mean is equal to 10
Table of Some of the more common classical quantitative techniques are listed
Contents below. This list of quantitative techniques is by no means meant to be
exhaustive. Additional discussions of classical statistical techniques are
contained in the product comparisons chapter.
● Location
1. Measures of Location
2. Confidence Limits for the Mean and One Sample t-Test
3. Two Sample t-Test for Equal Means
4. One Factor Analysis of Variance
5. Multi-Factor Analysis of Variance
● Scale (or variability or spread)
1. Measures of Scale
2. Bartlett's Test
3. Chi-Square Test
4. F-Test
5. Levene Test
● Skewness and Kurtosis
1. Measures of Skewness and Kurtosis
● Randomness
1. Autocorrelation
2. Runs Test
● Distributional Measures
1. Anderson-Darling Test
2. Chi-Square Goodness-of-Fit Test
3. Kolmogorov-Smirnov Test
● Outliers
1. Grubbs Test
● 2-Level Factorial Designs
1. Yates Analysis
Definition of The first step is to define what we mean by a typical value. For
Location univariate data, there are three common definitions:
1. mean - the mean is the sum of the data points divided by the
number of data points. That is,
3. mode - the mode is the value of the random sample that occurs
with the greatest frequency. It is not necessarily unique. The
mode is typically used in a qualitative fashion. For example, there
may be a single dominant hump in the data perhaps two or more
smaller humps in the data. This is usually evident from a
histogram of the data.
When taking samples from continuous populations, we need to be
somewhat careful in how we define the mode. That is, any
specific value may not occur more than once if the data are
continuous. What may be a more meaningful, if less exact
measure, is the midpoint of the class interval of the histogram
with the highest peak.
Why A natural question is why we have more than one measure of the typical
Different value. The following example helps to explain why these alternative
Measures definitions are useful and necessary.
This plot shows histograms for 10,000 random numbers generated from
a normal, an exponential, a Cauchy, and a lognormal distribution.
Normal The first histogram is a sample from a normal distribution. The mean is
Distribution 0.005, the median is -0.010, and the mode is -0.144 (the mode is
computed as the midpoint of the histogram interval with the highest
peak).
The normal distribution is a symmetric distribution with well-behaved
tails and a single peak at the center of the distribution. By symmetric,
we mean that the distribution can be folded about an axis so that the 2
sides coincide. That is, it behaves the same to the left and right of some
center point. For a normal distribution, the mean, median, and mode are
actually equivalent. The histogram above generates similar estimates for
the mean, median, and mode. Therefore, if a histogram or normal
probability plot indicates that your data are approximated well by a
normal distribution, then it is reasonable to use the mean as the location
estimator.
Cauchy The third histogram is a sample from a Cauchy distribution. The mean is
Distribution 3.70, the median is -0.016, and the mode is -0.362 (the mode is
computed as the midpoint of the histogram interval with the highest
peak).
For better visual comparison with the other data sets, we restricted the
histogram of the Cauchy distribution to values between -10 and 10. The
full Cauchy data set in fact has a minimum of approximately -29,000
and a maximum of approximately 89,000.
The Cauchy distribution is a symmetric distribution with heavy tails and
a single peak at the center of the distribution. The Cauchy distribution
has the interesting property that collecting more data does not provide a
more accurate estimate of the mean. That is, the sampling distribution of
the mean is equivalent to the sampling distribution of the original data.
This means that for the Cauchy distribution the mean is useless as a
measure of the typical value. For this histogram, the mean of 3.7 is well
above the vast majority of the data. This is caused by a few very
extreme values in the tail. However, the median does provide a useful
measure for the typical value.
Although the Cauchy distribution is an extreme case, it does illustrate
the importance of heavy tails in measuring the mean. Extreme values in
the tails distort the mean. However, these extreme values do not distort
the median since the median is based on ranks. In general, for data with
extreme values in the tails, the median provides a better estimate of
location than does the mean.
Robustness There are various alternatives to the mean and median for measuring
location. These alternatives were developed to address non-normal data
since the mean is an optimal estimator if in fact your data are normal.
Tukey and Mosteller defined two types of robustness where robustness
is a lack of susceptibility to the effects of nonnormality.
1. Robustness of validity means that the confidence intervals for the
population location have a 95% chance of covering the population
location regardless of what the underlying distribution is.
2. Robustness of efficiency refers to high effectiveness in the face of
non-normal tails. That is, confidence intervals for the population
location tend to be almost as narrow as the best that could be done
if we knew the true shape of the distributuion.
The mean is an example of an estimator that is the best we can do if the
underlying distribution is normal. However, it lacks robustness of
validity. That is, confidence intervals based on the mean tend not to be
precise if the underlying distribution is in fact not normal.
The median is an example of a an estimator that tends to have
robustness of validity but not robustness of efficiency.
The alternative measures of location try to balance these two concepts of
robustness. That is, the confidence intervals for the case when the data
are normal should be almost as narrow as the confidence intervals based
on the mean. However, they should maintain their validity even if the
underlying data are not normal. In particular, these alternatives address
the problem of heavy-tailed distributions.
Case Study The uniform random numbers case study compares the performance of
several different location estimators for a particular non-normal
distribution.
This simply means that noisy data, i.e., data with a large standard deviation, are
going to generate wider intervals than data with a smaller standard deviation.
Definition: To test whether the population mean has a specific value, , against the two-sided
Hypothesis alternative that it does not have a value , the confidence interval is converted to
Test hypothesis-test form. The test is a one-sample t-test, and it is defined as:
H0:
Ha:
Test Statistic:
where , N, and are defined as above.
Significance Level: . The most commonly used value for is 0.05.
Critical Region: Reject the null hypothesis that the mean is a specified value, ,
if
or
Sample Dataplot generated the following output for a confidence interval from the
Output for ZARR13.DAT data set:
Confidence
Interval
CONFIDENCE LIMITS FOR MEAN
(2-SIDED)
Interpretation The first few lines print the sample statistics used in calculating the confidence
of the Sample interval. The table shows the confidence interval for several different significance
Output levels. The first column lists the confidence level (which is 1 - expressed as a
percent), the second column lists the t-value (i.e., ), the third column lists
the t-value times the standard error (the standard error is ), the fourth column
lists the lower confidence limit, and the fifth column lists the upper confidence limit.
For example, for a 95% confidence interval, we go to the row identified by 95.000 in
the first column and extract an interval of (9.25824, 9.26468) from the last two
columns.
Output from other statistical software may look somewhat different from the above
output.
Sample Dataplot generated the following output for a one-sample t-test from the
Output for t ZARR13.DAT data set:
Test
T TEST
(1-SAMPLE)
MU0 = 5.000000
NULL HYPOTHESIS UNDER TEST--MEAN MU = 5.000000
SAMPLE:
NUMBER OF OBSERVATIONS = 195
MEAN = 9.261460
STANDARD DEVIATION = 0.2278881E-01
STANDARD DEVIATION OF MEAN = 0.1631940E-02
TEST:
MEAN-MU0 = 4.261460
T TEST STATISTIC VALUE = 2611.284
DEGREES OF FREEDOM = 194.0000
T TEST STATISTIC CDF VALUE = 1.000000
ALTERNATIVE- ALTERNATIVE-
ALTERNATIVE- HYPOTHESIS HYPOTHESIS
HYPOTHESIS ACCEPTANCE INTERVAL CONCLUSION
MU <> 5.000000 (0,0.025) (0.975,1) ACCEPT
MU < 5.000000 (0,0.05) REJECT
MU > 5.000000 (0.95,1) ACCEPT
Interpretation We are testing the hypothesis that the population mean is 5. The output is divided into
of Sample three sections.
Output 1. The first section prints the sample statistics used in the computation of the t-test.
2. The second section prints the t-test statistic value, the degrees of freedom, and
the cumulative distribution function (cdf) value of the t-test statistic. The t-test
statistic cdf value is an alternative way of expressing the critical value. This cdf
value is compared to the acceptance intervals printed in section three. For an
upper one-tailed test, the alternative hypothesis acceptance interval is (1 - ,1),
the alternative hypothesis acceptance interval for a lower one-tailed test is (0,
), and the alternative hypothesis acceptance interval for a two-tailed test is (1 -
/2,1) or (0, /2). Note that accepting the alternative hypothesis is equivalent to
rejecting the null hypothesis.
3. The third section prints the conclusions for a 95% test since this is the most
common case. Results are given in terms of the alternative hypothesis for the
two-tailed test and for the one-tailed test in both directions. The alternative
hypothesis acceptance interval column is stated in terms of the cdf value printed
in section two. The last column specifies whether the alternative hypothesis is
accepted or rejected. For a different significance level, the appropriate
conclusion can be drawn from the t-test statistic cdf value printed in section
two. For example, for a significance level of 0.10, the corresponding alternative
hypothesis acceptance intervals are (0,0.05) and (0.95,1), (0, 0.10), and (0.90,1).
Output from other statistical software may look somewhat different from the above
output.
Questions Confidence limits for the mean can be used to answer the following questions:
1. What is a reasonable estimate for the mean?
2. How much variability is there in the estimate of the mean?
3. Does a given target value fall within the confidence limits?
Software Confidence limits for the mean and one-sample t-tests are available in just about all
general purpose statistical software programs, including Dataplot.
Definition The two sample t test for unpaired data is defined as:
H0:
Ha:
Test
Statistic:
where N1 and N2 are the sample sizes, and are the sample
means, and and are the sample variances.
where
Significance .
Level:
Critical Reject the null hypothesis that the two means are equal if
Region:
or
Sample Dataplot generated the following output for the t test from the AUTO83B.DAT
Output data set:
T TEST
(2-SAMPLE)
NULL HYPOTHESIS UNDER TEST--POPULATION MEANS MU1 = MU2
SAMPLE 1:
NUMBER OF OBSERVATIONS = 249
MEAN = 20.14458
STANDARD DEVIATION = 6.414700
STANDARD DEVIATION OF MEAN = 0.4065151
SAMPLE 2:
NUMBER OF OBSERVATIONS = 79
MEAN = 30.48101
STANDARD DEVIATION = 6.107710
STANDARD DEVIATION OF MEAN = 0.6871710
ALTERNATIVE- ALTERNATIVE-
ALTERNATIVE- HYPOTHESIS HYPOTHESIS
HYPOTHESIS ACCEPTANCE INTERVAL CONCLUSION
MU1 <> MU2 (0,0.025) (0.975,1) ACCEPT
MU1 < MU2 (0,0.05) ACCEPT
MU1 > MU2 (0.95,1) REJECT
Interpretation We are testing the hypothesis that the population mean is equal for the two
of Sample samples. The output is divided into five sections.
Output 1. The first section prints the sample statistics for sample one used in the
computation of the t-test.
2. The second section prints the sample statistics for sample two used in the
computation of the t-test.
3. The third section prints the pooled standard deviation, the difference in the
means, the t-test statistic value, the degrees of freedom, and the cumulative
distribution function (cdf) value of the t-test statistic under the assumption
that the standard deviations are equal. The t-test statistic cdf value is an
alternative way of expressing the critical value. This cdf value is compared
to the acceptance intervals printed in section five. For an upper one-tailed
test, the acceptance interval is (0,1 - ), the acceptance interval for a
two-tailed test is ( /2, 1 - /2), and the acceptance interval for a lower
one-tailed test is ( ,1).
4. The fourth section prints the pooled standard deviation, the difference in
the means, the t-test statistic value, the degrees of freedom, and the
cumulative distribution function (cdf) value of the t-test statistic under the
assumption that the standard deviations are not equal. The t-test statistic cdf
value is an alternative way of expressing the critical value. cdf value is
compared to the acceptance intervals printed in section five. For an upper
one-tailed test, the alternative hypothesis acceptance interval is (1 - ,1),
the alternative hypothesis acceptance interval for a lower one-tailed test is
(0, ), and the alternative hypothesis acceptance interval for a two-tailed
test is (1 - /2,1) or (0, /2). Note that accepting the alternative hypothesis
is equivalent to rejecting the null hypothesis.
5. The fifth section prints the conclusions for a 95% test under the assumption
that the standard deviations are not equal since a 95% test is the most
common case. Results are given in terms of the alternative hypothesis for
the two-tailed test and for the one-tailed test in both directions. The
alternative hypothesis acceptance interval column is stated in terms of the
cdf value printed in section four. The last column specifies whether the
alternative hypothesis is accepted or rejected. For a different significance
level, the appropriate conclusion can be drawn from the t-test statistic cdf
value printed in section four. For example, for a significance level of 0.10,
the corresponding alternative hypothesis acceptance intervals are (0,0.05)
and (0.95,1), (0, 0.10), and (0.90,1).
Output from other statistical software may look somewhat different from the
above output.
Software Two-sample t-tests are available in just about all general purpose statistical
software programs, including Dataplot.
18 19
14 32
14 34
14 26
14 30
12 22
13 22
13 33
18 39
22 36
19 28
18 27
23 21
26 24
25 30
20 34
21 32
13 38
14 37
15 30
14 31
17 37
11 32
13 47
12 41
13 45
15 34
13 33
13 24
14 32
22 39
28 35
13 32
14 37
13 38
14 34
15 34
12 32
13 33
13 32
14 25
13 24
12 37
13 31
18 36
16 36
18 34
18 38
23 32
11 38
12 32
13 -999
12 -999
18 -999
21 -999
19 -999
21 -999
15 -999
16 -999
15 -999
11 -999
20 -999
21 -999
19 -999
15 -999
26 -999
25 -999
16 -999
16 -999
18 -999
16 -999
13 -999
14 -999
14 -999
14 -999
28 -999
19 -999
18 -999
15 -999
15 -999
16 -999
15 -999
16 -999
14 -999
17 -999
16 -999
15 -999
18 -999
21 -999
20 -999
13 -999
23 -999
20 -999
23 -999
18 -999
19 -999
25 -999
26 -999
18 -999
16 -999
16 -999
15 -999
22 -999
22 -999
24 -999
23 -999
29 -999
25 -999
20 -999
18 -999
19 -999
18 -999
27 -999
13 -999
17 -999
13 -999
13 -999
13 -999
30 -999
26 -999
18 -999
17 -999
16 -999
15 -999
18 -999
21 -999
19 -999
19 -999
16 -999
16 -999
16 -999
16 -999
25 -999
26 -999
31 -999
34 -999
36 -999
20 -999
19 -999
20 -999
19 -999
21 -999
20 -999
25 -999
21 -999
19 -999
21 -999
21 -999
19 -999
18 -999
19 -999
18 -999
18 -999
18 -999
30 -999
31 -999
23 -999
24 -999
22 -999
20 -999
22 -999
20 -999
21 -999
17 -999
18 -999
17 -999
18 -999
17 -999
16 -999
19 -999
19 -999
36 -999
27 -999
23 -999
24 -999
34 -999
35 -999
28 -999
29 -999
27 -999
34 -999
32 -999
28 -999
26 -999
24 -999
19 -999
28 -999
24 -999
27 -999
27 -999
26 -999
24 -999
30 -999
39 -999
35 -999
34 -999
30 -999
22 -999
27 -999
20 -999
18 -999
28 -999
27 -999
34 -999
31 -999
29 -999
27 -999
24 -999
23 -999
38 -999
36 -999
25 -999
38 -999
26 -999
22 -999
36 -999
27 -999
27 -999
32 -999
28 -999
31 -999
This model decomposes the response into a mean for each cell and an
error term. The analysis of variance provides estimates for each cell
mean. These estimated cell means are the predicted values of the model
and the differences between the response variable and the estimated cell
means are the residuals. That is
This model decomposes the response into an overall (grand) mean, the
effect of the ith factor level, and an error term. The analysis of variance
provides estimates of the grand mean and the effect of the ith factor
level. The predicted values and the residuals of the model are
The distinction between these models is that the second model divides
Model Note that the ANOVA model assumes that the error term, Eij, should
Validation follow the assumptions for a univariate measurement process. That is,
after performing an analysis of variance, the model should be validated
by analyzing the residuals.
Sample Dataplot generated the following output for the one-way analysis of variance from the GEAR.DAT data set.
Output
*****************
* ANOVA TABLE *
*****************
****************
* ESTIMATION *
****************
For these data, including the factor effect reduces the residual
standard deviation from 0.00623 to 0.0059. That is, although the
factor is statistically significant, it has minimal improvement
over a simple constant model. This is because the factor is just
barely significant.
Output from other statistical software may look somewhat different
from the above output.
In addition to the quantitative ANOVA output, it is recommended that
any analysis of variance be complemented with model validation. At a
minimum, this should include
1. A run sequence plot of the residuals.
2. A normal probability plot of the residuals.
3. A scatter plot of the predicted values against the residuals.
Question The analysis of variance can be used to answer the following question
● Are means the same across groups in the data?
This model decomposes the response into a mean for each cell and an
error term. The analysis of variance provides estimates for each cell
mean. These cell means are the predicted values of the model and the
differences between the response variable and the estimated cell means
are the residuals. That is
The distinction between these models is that the second model divides
the cell mean into an overall mean and factor effects. This second model
makes the factor effect more explicit, so we will emphasize this
approach.
Model Note that the ANOVA model assumes that the error term, Eijk, should
Validation follow the assumptions for a univariate measurement process. That is,
after performing an analysis of variance, the model should be validated
by analyzing the residuals.
Sample Dataplot generated the following ANOVA output for the JAHANMI2.DAT data set:
Output
**********************************
**********************************
** 4-WAY ANALYSIS OF VARIANCE **
**********************************
**********************************
*****************
* ANOVA TABLE *
*****************
****************
* ESTIMATION *
****************
a model with each factor individually, and the model with all
four factors included.
For these data, we see that including factor 4 has a significant
impact on the residual standard deviation (63.73 when only the
factor 4 effect is included compared to 63.058 when all four
factors are included).
Output from other statistical software may look somewhat different
from the above output.
In addition to the quantitative ANOVA output, it is recommended that
any analysis of variance be complemented with model validation. At a
minimum, this should include
1. A run sequence plot of the residuals.
2. A normal probability plot of the residuals.
3. A scatter plot of the predicted values against the residuals.
Case Study The quantitative ANOVA approach can be contrasted with the more
graphical EDA approach in the ceramic strength case study.
Definitions of For univariate data, there are several common numerical measures of
Variability the spread:
1. variance - the variance is defined as
where is the mean of the data and |Y| is the absolute value of
Y. This measure does not square the distance from the mean, so
it is less affected by extreme observations than are the variance
and standard deviation.
5. median absolute deviation - the median absolute deviation
(MAD) is defined as
where is the median of the data and |Y| is the absolute value
of Y. This is a variation of the average absolute deviation that is
even less affected by extremes in the tail because the data in the
tails have less influence on the calculation of the median than
they do on the mean.
6. interquartile range - this is the value of the 75th percentile
minus the value of the 25th percentile. This measure of scale
attempts to measure the variability of points near the center.
In summary, the variance, standard deviation, average absolute
deviation, and median absolute deviation measure both aspects of the
variability; that is, the variability near the center and the variability in
the tails. They differ in that the average absolute deviation and median
absolute deviation do not give undue weight to the tail behavior. On
the other hand, the range only uses the two most extreme points and
the interquartile range only uses the middle portion of the data.
Why Different The following example helps to clarify why these alternative
Measures? defintions of spread are useful and necessary.
This plot shows histograms for 10,000 random numbers generated
from a normal, a double exponential, a Cauchy, and a Tukey-Lambda
distribution.
In the above, si2 is the variance of the ith group, N is the total sample size,
Ni is the sample size of the ith group, k is the number of groups, and sp2 is
the pooled variance. The pooled variance is a weighted average of the
group variances and is defined as:
Significance
Level:
Sample Dataplot generated the following output for Bartlett's test using the GEAR.DAT
Output data set:
BARTLETT TEST
(STANDARD DEFINITION)
NULL HYPOTHESIS UNDER TEST--ALL SIGMA(I) ARE EQUAL
TEST:
DEGREES OF FREEDOM = 9.000000
Interpretation We are testing the hypothesis that the group variances are all equal.
of Sample The output is divided into two sections.
Output 1. The first section prints the value of the Bartlett test statistic, the
degrees of freedom (k-1), the upper critical value of the
chi-square distribution corresponding to significance levels of
0.05 (the 95% percent point) and 0.01 (the 99% percent point).
We reject the null hypothesis at that significance level if the
value of the Bartlett test statistic is greater than the
corresponding critical value.
2. The second section prints the conclusion for a 95% test.
Output from other statistical software may look somewhat different
from the above output.
where N is the sample size and is the sample standard deviation. The key
element of this formula is the ratio which compares the ratio of the
sample standard deviation to the target standard deviation. The more this
ratio deviates from 1, the more likely we are to reject the null hypothesis.
Significance Level: .
Critical Region: Reject the null hypothesis that the standard deviation is a specified value,
, if
In the above formulas for the critical regions, the Handbook follows the
convention that is the upper critical value from the chi-square
distribution and is the lower critical value from the chi-square
distribution. Note that this is the opposite of some texts and software
programs. In particular, Dataplot uses the opposite convention.
The formula for the hypothesis test can easily be converted to form an interval estimate for the
standard deviation:
Sample Dataplot generated the following output for a chi-square test from the GEAR.DAT data set:
Output
CHI-SQUARED TEST
SIGMA0 = 0.1000000
NULL HYPOTHESIS UNDER TEST--STANDARD DEVIATION SIGMA = .1000000
SAMPLE:
NUMBER OF OBSERVATIONS = 100
MEAN = 0.9976400
STANDARD DEVIATION S = 0.6278908E-02
TEST:
S/SIGMA0 = 0.6278908E-01
CHI-SQUARED STATISTIC = 0.3903044
DEGREES OF FREEDOM = 99.00000
CHI-SQUARED CDF VALUE = 0.000000
ALTERNATIVE- ALTERNATIVE-
ALTERNATIVE- HYPOTHESIS HYPOTHESIS
HYPOTHESIS ACCEPTANCE INTERVAL CONCLUSION
SIGMA <> .1000000 (0,0.025), (0.975,1) ACCEPT
SIGMA < .1000000 (0,0.05) ACCEPT
SIGMA > .1000000 (0.95,1) REJECT
Interpretation We are testing the hypothesis that the population standard deviation is 0.1. The output is
of Sample divided into three sections.
Output 1. The first section prints the sample statistics used in the computation of the chi-square
test.
2. The second section prints the chi-square test statistic value, the degrees of freedom, and
the cumulative distribution function (cdf) value of the chi-square test statistic. The
chi-square test statistic cdf value is an alternative way of expressing the critical value.
This cdf value is compared to the acceptance intervals printed in section three. For an
upper one-tailed test, the alternative hypothesis acceptance interval is (1 - ,1), the
alternative hypothesis acceptance interval for a lower one-tailed test is (0, ), and the
alternative hypothesis acceptance interval for a two-tailed test is (1 - /2,1) or (0, /2).
Note that accepting the alternative hypothesis is equivalent to rejecting the null
hypothesis.
3. The third section prints the conclusions for a 95% test since this is the most common
case. Results are given in terms of the alternative hypothesis for the two-tailed test and
for the one-tailed test in both directions. The alternative hypothesis acceptance interval
column is stated in terms of the cdf value printed in section two. The last column
specifies whether the alternative hypothesis is accepted or rejected. For a different
significance level, the appropriate conclusion can be drawn from the chi-square test
statistic cdf value printed in section two. For example, for a significance level of 0.10,
the corresponding alternative hypothesis acceptance intervals are (0,0.05) and (0.95,1),
(0, 0.10), and (0.90,1).
Output from other statistical software may look somewhat different from the above output.
Questions The chi-square test can be used to answer the following questions:
1. Is the standard deviation equal to some pre-determined threshold value?
2. Is the standard deviation greater than some pre-determined threshold value?
3. Is the standard deviation less than some pre-determined threshold value?
Related F Test
Techniques Bartlett Test
Levene Test
Software The chi-square test for the standard deviation is available in many general purpose statistical
software programs, including Dataplot.
0.999 3.000
0.996 3.000
0.996 3.000
1.005 4.000
1.002 4.000
0.994 4.000
1.000 4.000
0.995 4.000
0.994 4.000
0.998 4.000
0.996 4.000
1.002 4.000
0.996 4.000
0.998 5.000
0.998 5.000
0.982 5.000
0.990 5.000
1.002 5.000
0.984 5.000
0.996 5.000
0.993 5.000
0.980 5.000
0.996 5.000
1.009 6.000
1.013 6.000
1.009 6.000
0.997 6.000
0.988 6.000
1.002 6.000
0.995 6.000
0.998 6.000
0.981 6.000
0.996 6.000
0.990 7.000
1.004 7.000
0.996 7.000
1.001 7.000
0.998 7.000
1.000 7.000
1.018 7.000
1.010 7.000
0.996 7.000
1.002 7.000
0.998 8.000
1.000 8.000
1.006 8.000
1.000 8.000
1.002 8.000
0.996 8.000
0.998 8.000
0.996 8.000
1.002 8.000
1.006 8.000
1.002 9.000
0.998 9.000
0.996 9.000
0.995 9.000
0.996 9.000
1.004 9.000
1.004 9.000
0.998 9.000
0.999 9.000
0.991 9.000
0.991 10.000
0.995 10.000
0.984 10.000
0.994 10.000
0.997 10.000
0.997 10.000
0.991 10.000
0.998 10.000
1.004 10.000
0.997 10.000
Critical The hypothesis that the two standard deviations are equal is rejected if
Region:
for an upper one-tailed test
or
Sample Dataplot generated the following output for an F-test from the JAHANMI2.DAT data
Output set:
F TEST
NULL HYPOTHESIS UNDER TEST--SIGMA1 = SIGMA2
ALTERNATIVE HYPOTHESIS UNDER TEST--SIGMA1 NOT EQUAL SIGMA2
SAMPLE 1:
NUMBER OF OBSERVATIONS = 240
MEAN = 688.9987
STANDARD DEVIATION = 65.54909
SAMPLE 2:
NUMBER OF OBSERVATIONS = 240
MEAN = 611.1559
STANDARD DEVIATION = 61.85425
TEST:
STANDARD DEV. (NUMERATOR) = 65.54909
STANDARD DEV. (DENOMINATOR) = 61.85425
F TEST STATISTIC VALUE = 1.123037
DEG. OF FREEDOM (NUMER.) = 239.0000
DEG. OF FREEDOM (DENOM.) = 239.0000
F TEST STATISTIC CDF VALUE = 0.814808
Interpretation We are testing the hypothesis that the standard deviations for sample one and sample
of Sample two are equal. The output is divided into four sections.
Output 1. The first section prints the sample statistics for sample one used in the
computation of the F-test.
2. The second section prints the sample statistics for sample two used in the
computation of the F-test.
3. The third section prints the numerator and denominator standard deviations, the
F-test statistic value, the degrees of freedom, and the cumulative distribution
function (cdf) value of the F-test statistic. The F-test statistic cdf value is an
alternative way of expressing the critical value. This cdf value is compared to the
acceptance interval printed in section four. The acceptance interval for a
two-tailed test is (0,1 - ).
4. The fourth section prints the conclusions for a 95% test since this is the most
common case. Results are printed for an upper one-tailed test. The acceptance
interval column is stated in terms of the cdf value printed in section three. The
last column specifies whether the null hypothesis is accepted or rejected. For a
different significance level, the appropriate conclusion can be drawn from the
F-test statistic cdf value printed in section four. For example, for a significance
level of 0.10, the corresponding acceptance interval become (0.000,0.9000).
Output from other statistical software may look somewhat different from the above
output.
Software The F-test for equality of two standard deviations is available in many general purpose
statistical software programs, including Dataplot.
Critical The Levene test rejects the hypothesis that the variances are
Region: equal if
Sample Dataplot generated the following output for Levene's test using the
Output GEAR.DAT data set:
1. STATISTICS
NUMBER OF OBSERVATIONS = 100
NUMBER OF GROUPS = 10
LEVENE F TEST STATISTIC = 1.705910
Interpretation We are testing the hypothesis that the group variances are equal. The
of Sample output is divided into three sections.
Output 1. The first section prints the number of observations (N), the number
of groups (k), and the value of the Levene test statistic.
2. The second section prints the upper critical value of the F
distribution corresponding to various significance levels. The value
in the first column, the confidence level of the test, is equivalent to
100(1- ). We reject the null hypothesis at that significance level if
the value of the Levene F test statistic printed in section one is
greater than the critical value printed in the last column.
3. The third section prints the conclusion for a 95% test. For a
different significance level, the appropriate conclusion can be drawn
from the table printed in section two. For example, for = 0.10, we
look at the row for 90% confidence and compare the critical value
1.702 to the Levene test statistic 1.7059. Since the test statistic is
greater than the critical value, we reject the null hypothesis at the
= 0.10 level.
Output from other statistical software may look somewhat different from
the above output.
Software The Levene test is available in some general purpose statistical software
programs, including Dataplot.
Definition of For univariate data Y1, Y2, ..., YN, the formula for skewness is:
Skewness
Definition of For univariate data Y1, Y2, ..., YN, the formula for kurtosis is:
Kurtosis
Examples The following example shows histograms for 10,000 random numbers
generated from a normal, a double exponential, a Cauchy, and a Weibull
distribution.
Normal The first histogram is a sample from a normal distribution. The normal
Distribution distribution is a symmetric distribution with well-behaved tails. This is
indicated by the skewness of 0.03. The kurtosis of 2.96 is near the
expected value of 3. The histogram verifies the symmetry.
Weibull The fourth histogram is a sample from a Weibull distribution with shape
Distribution parameter 1.5. The Weibull distribution is a skewed distribution with the
amount of skewness depending on the value of the shape parameter. The
degree of decay as we move away from the center also depends on the
value of the shape parameter. For this data set, the skewness is 1.08 and
the kurtosis is 4.46, which indicates moderate skewness and kurtosis.
correlation coefficient plot and the probability plot are useful tools for
determining a good distributional model for the data.
Software The skewness and kurtosis coefficients are available in most general
purpose statistical software programs, including Dataplot.
1.3.5.12. Autocorrelation
Purpose: The autocorrelation ( Box and Jenkins, 1976) function can be used for
Detect the following two purposes:
Non-Randomness, 1. To detect non-randomness in data.
Time Series
Modeling 2. To identify an appropriate time series model if the data are not
random.
Definition Given measurements, Y1, Y2, ..., YN at time X1, X2, ..., XN, the lag k
autocorrelation function is defined as
Sample Output Dataplot generated the following autocorrelation output using the
LEW.DAT data set:
lag autocorrelation
0. 1.00
1. -0.31
2. -0.74
3. 0.77
4. 0.21
5. -0.90
6. 0.38
7. 0.63
8. -0.77
9. -0.12
10. 0.82
11. -0.40
12. -0.55
13. 0.73
14. 0.07
15. -0.76
16. 0.40
17. 0.48
18. -0.70
19. -0.03
20. 0.70
21. -0.41
22. -0.43
23. 0.67
24. 0.00
25. -0.66
26. 0.42
27. 0.39
28. -0.65
29. 0.03
30. 0.63
31. -0.42
32. -0.36
33. 0.64
34. -0.05
35. -0.60
36. 0.43
37. 0.32
38. -0.64
39. 0.08
40. 0.58
41. -0.45
42. -0.28
43. 0.62
44. -0.10
45. -0.55
46. 0.45
47. 0.25
48. -0.61
49. 0.14
Case Study The heat flow meter data demonstrate the use of autocorrelation in
determining if the data are from a random process.
The beam deflection data demonstrate the use of autocorrelation in
developing a non-linear sinusoidal model.
Typical Analysis The first step in the runs test is to compute the sequential differences (Yi -
and Test Yi-1). Positive values indicate an increasing value and negative values
Statistics indicate a decreasing value. A runs test should include information such as
the output shown below from Dataplot for the LEW.DAT data set. The
output shows a table of:
1. runs of length exactly I for I = 1, 2, ..., 10
2. number of runs of length I
3. expected number of runs of length I
4. standard deviation of the number of runs of length I
5. a z-score where the z-score is defined to be
Sample Output Dataplot generated the following runs test output using the LEW.DAT data
set:
RUNS UP
RUNS DOWN
Interpretation of Scanning the last column labeled "Z", we note that most of the z-scores for
Sample Output run lengths 1, 2, and 3 have an absolute value greater than 1.96. This is strong
evidence that these data are in fact not random.
Output from other statistical software may look somewhat different from the
above output.
Question The runs test can be used to answer the following question:
● Were these sample data generated from a random process?
Related Autocorrelation
Techniques Run Sequence Plot
Lag Plot
Significance
Level:
Critical The critical values for the Anderson-Darling test are dependent
Region: on the specific distribution that is being tested. Tabulated values
and formulas have been published (Stephens, 1974, 1976, 1977,
1979) for a few specific distributions (normal, lognormal,
exponential, Weibull, logistic, extreme value type 1). The test is
a one-sided test and the hypothesis that the distribution is of a
specific form is rejected if the test statistic, A, is greater than the
critical value.
Note that for a given distribution, the Anderson-Darling statistic
may be multiplied by a constant (which usually depends on the
sample size, n). These constants are given in the various papers
by Stephens. In the sample output below, this is the "adjusted
Anderson-Darling" statistic. This is what should be compared
against the critical values. Also, be aware that different constants
(and therefore critical values) have been published. You just
need to be aware of what constant was used for a given set of
critical values (the needed constant is typically given with the
critical values).
Sample Dataplot generated the following output for the Anderson-Darling test. 1,000
Output random numbers were generated for a normal, double exponential, Cauchy,
and lognormal distribution. In all four cases, the Anderson-Darling test was
applied to test for a normal distribution. When the data were generated using a
normal distribution, the test statistic was small and the hypothesis was
accepted. When the data were generated using the double exponential, Cauchy,
and lognormal distributions, the statistics were significant, and the hypothesis
of an underlying normal distribution was rejected at significance levels of 0.10,
0.05, and 0.01.
The normal random numbers were stored in the variable Y1, the double
exponential random numbers were stored in the variable Y2, the Cauchy
random numbers were stored in the variable Y3, and the lognormal random
numbers were stored in the variable Y4.
***************************************
** anderson darling normal test y1 **
***************************************
1. STATISTICS:
NUMBER OF OBSERVATIONS = 1000
MEAN = 0.4359940E-02
2. CRITICAL VALUES:
90 % POINT = 0.6560000
95 % POINT = 0.7870000
97.5 % POINT = 0.9180000
99 % POINT = 1.092000
***************************************
** anderson darling normal test y2 **
***************************************
1. STATISTICS:
NUMBER OF OBSERVATIONS = 1000
MEAN = 0.2034888E-01
STANDARD DEVIATION = 1.321627
2. CRITICAL VALUES:
90 % POINT = 0.6560000
95 % POINT = 0.7870000
97.5 % POINT = 0.9180000
99 % POINT = 1.092000
***************************************
** anderson darling normal test y3 **
***************************************
1. STATISTICS:
NUMBER OF OBSERVATIONS = 1000
MEAN = 1.503854
STANDARD DEVIATION = 35.13059
2. CRITICAL VALUES:
90 % POINT = 0.6560000
95 % POINT = 0.7870000
97.5 % POINT = 0.9180000
99 % POINT = 1.092000
***************************************
** anderson darling normal test y4 **
***************************************
1. STATISTICS:
NUMBER OF OBSERVATIONS = 1000
MEAN = 1.518372
STANDARD DEVIATION = 1.719969
2. CRITICAL VALUES:
90 % POINT = 0.6560000
95 % POINT = 0.7870000
97.5 % POINT = 0.9180000
99 % POINT = 1.092000
Anderson-Darling test statistic (for the normal data) 0.256. Since the test
statistic is less than the critical value, we do not reject the null
hypothesis at the = 0.10 level.
As we would hope, the Anderson-Darling test accepts the hypothesis of
normality for the normal random numbers and rejects it for the 3 non-normal
cases.
The output from other statistical software programs may differ somewhat from
the output above.
Questions The Anderson-Darling test can be used to answer the following questions:
● Are the data from a normal distribution?
Importance Many statistical tests and procedures are based on specific distributional
assumptions. The assumption of normality is particularly common in classical
statistical tests. Much reliability modeling is based on the assumption that the
data follow a Weibull distribution.
There are many non-parametric and robust techniques that do not make strong
distributional assumptions. However, techniques based on specific
distributional assumptions are in general more powerful than non-parametric
and robust techniques. Therefore, if the distributional assumptions can be
validated, they are generally preferred.
Test Statistic: For the chi-square goodness-of-fit computation, the data are divided
into k bins and the test statistic is defined as
Sample Dataplot generated the following output for the chi-square test where 1,000 random
Output numbers were generated for the normal, double exponential, t with 3 degrees of freedom,
and lognormal distributions. In all cases, the chi-square test was applied to test for a
normal distribution. The test statistics show the characteristics of the test; when the data
are from a normal distribution, the test statistic is small and the hypothesis is accepted;
when the data are from the double exponential, t, and lognormal distributions, the
statistics are significant and the hypothesis of an underlying normal distribution is
rejected at significance levels of 0.10, 0.05, and 0.01.
The normal random numbers were stored in the variable Y1, the double exponential
random numbers were stored in the variable Y2, the t random numbers were stored in the
variable Y3, and the lognormal random numbers were stored in the variable Y4.
*************************************************
** normal chi-square goodness of fit test y1 **
*************************************************
SAMPLE:
NUMBER OF OBSERVATIONS = 1000
NUMBER OF NON-EMPTY CELLS = 24
NUMBER OF PARAMETERS USED = 0
TEST:
CHI-SQUARED TEST STATISTIC = 17.52155
DEGREES OF FREEDOM = 23
CHI-SQUARED CDF VALUE = 0.217101
*************************************************
** normal chi-square goodness of fit test y2 **
*************************************************
SAMPLE:
NUMBER OF OBSERVATIONS = 1000
NUMBER OF NON-EMPTY CELLS = 26
NUMBER OF PARAMETERS USED = 0
TEST:
CHI-SQUARED TEST STATISTIC = 2030.784
DEGREES OF FREEDOM = 25
CHI-SQUARED CDF VALUE = 1.000000
*************************************************
** normal chi-square goodness of fit test y3 **
*************************************************
SAMPLE:
NUMBER OF OBSERVATIONS = 1000
NUMBER OF NON-EMPTY CELLS = 25
NUMBER OF PARAMETERS USED = 0
TEST:
CHI-SQUARED TEST STATISTIC = 103165.4
DEGREES OF FREEDOM = 24
CHI-SQUARED CDF VALUE = 1.000000
1% 42.97982 REJECT H0
*************************************************
** normal chi-square goodness of fit test y4 **
*************************************************
SAMPLE:
NUMBER OF OBSERVATIONS = 1000
NUMBER OF NON-EMPTY CELLS = 10
NUMBER OF PARAMETERS USED = 0
TEST:
CHI-SQUARED TEST STATISTIC = 1162098.
DEGREES OF FREEDOM = 9
CHI-SQUARED CDF VALUE = 1.000000
As we would hope, the chi-square test does not reject the normality hypothesis for the
normal distribution data set and rejects it for the three non-normal cases.
Questions The chi-square test can be used to answer the following types of questions:
● Are the data from a normal distribution?
Importance Many statistical tests and procedures are based on specific distributional assumptions.
The assumption of normality is particularly common in classical statistical tests. Much
reliability modeling is based on the assumption that the distribution of the data follows a
Weibull distribution.
There are many non-parametric and robust techniques that are not based on strong
distributional assumptions. By non-parametric, we mean a technique, such as the sign
test, that is not based on a specific distributional assumption. By robust, we mean a
statistical technique that performs well under a wide range of distributional assumptions.
However, techniques based on specific distributional assumptions are in general more
powerful than these non-parametric and robust techniques. By power, we mean the ability
to detect a difference when that difference actually exists. Therefore, if the distributional
assumption can be confirmed, the parametric techniques are generally preferred.
If you are using a technique that makes a normality (or some other type of distributional)
assumption, it is important to confirm that this assumption is in fact justified. If it is, the
more powerful parametric techniques can be used. If the distributional assumption is not
justified, a non-parametric or robust technique may be required.
Software Some general purpose statistical software programs, including Dataplot, provide a
chi-square goodness-of-fit test for at least some of the common distributions.
where n(i) is the number of points less than Yi and the Yi are ordered from
smallest to largest value. This is a step function that increases by 1/N at the value
of each ordered data point.
The graph below is a plot of the empirical distribution function with a normal
cumulative distribution function for 100 normal random numbers. The K-S test is
based on the maximum distance between these two curves.
Characteristics An attractive feature of this test is that the distribution of the K-S test statistic
and itself does not depend on the underlying cumulative distribution function being
Limitations of tested. Another advantage is that it is an exact test (the chi-square goodness-of-fit
the K-S Test test depends on an adequate sample size for the approximations to be valid).
Despite these advantages, the K-S test has several important limitations:
1. It only applies to continuous distributions.
2. It tends to be more sensitive near the center of the distribution than at the
tails.
3. Perhaps the most serious limitation is that the distribution must be fully
specified. That is, if location, scale, and shape parameters are estimated
from the data, the critical region of the K-S test is no longer valid. It
typically must be determined by simulation.
Due to limitations 2 and 3 above, many analysts prefer to use the
Anderson-Darling goodness-of-fit test. However, the Anderson-Darling test is
only available for a few specific distributions.
Sample Output Dataplot generated the following output for the Kolmogorov-Smirnov test where
1,000 random numbers were generated for a normal, double exponential, t with 3
degrees of freedom, and lognormal distributions. In all cases, the
Kolmogorov-Smirnov test was applied to test for a normal distribution. The
Kolmogorov-Smirnov test accepts the normality hypothesis for the case of normal
data and rejects it for the double exponential, t, and lognormal data with the
exception of the double exponential data being significant at the 0.01 significance
level.
The normal random numbers were stored in the variable Y1, the double
exponential random numbers were stored in the variable Y2, the t random
numbers were stored in the variable Y3, and the lognormal random numbers were
stored in the variable Y4.
*********************************************************
** normal Kolmogorov-Smirnov goodness of fit test y1 **
*********************************************************
TEST:
KOLMOGOROV-SMIRNOV TEST STATISTIC = 0.2414924E-01
*********************************************************
** normal Kolmogorov-Smirnov goodness of fit test y2 **
*********************************************************
TEST:
KOLMOGOROV-SMIRNOV TEST STATISTIC = 0.5140864E-01
*********************************************************
** normal Kolmogorov-Smirnov goodness of fit test y3 **
*********************************************************
TEST:
KOLMOGOROV-SMIRNOV TEST STATISTIC = 0.6119353E-01
*********************************************************
** normal Kolmogorov-Smirnov goodness of fit test y4 **
*********************************************************
TEST:
KOLMOGOROV-SMIRNOV TEST STATISTIC = 0.5354889
Questions The Kolmogorov-Smirnov test can be used to answer the following types of
questions:
● Are the data from a normal distribution?
Importance Many statistical tests and procedures are based on specific distributional
assumptions. The assumption of normality is particularly common in classical
statistical tests. Much reliability modeling is based on the assumption that the
data follow a Weibull distribution.
There are many non-parametric and robust techniques that are not based on strong
distributional assumptions. By non-parametric, we mean a technique, such as the
sign test, that is not based on a specific distributional assumption. By robust, we
mean a statistical technique that performs well under a wide range of
distributional assumptions. However, techniques based on specific distributional
assumptions are in general more powerful than these non-parametric and robust
techniques. By power, we mean the ability to detect a difference when that
difference actually exists. Therefore, if the distributional assumptions can be
confirmed, the parametric techniques are generally preferred.
If you are using a technique that makes a normality (or some other type of
distributional) assumption, it is important to confirm that this assumption is in
fact justified. If it is, the more powerful parametric techniques can be used. If the
distributional assumption is not justified, using a non-parametric or robust
technique may be required.
Software Some general purpose statistical software programs, including Dataplot, support
the Kolmogorov-Smirnov goodness-of-fit test, at least for some of the more
common distributions.
Sample Dataplot generated the following output for the ZARR13.DAT data set
Output showing that Grubbs' test finds no outliers in the dataset:
*********************
** grubbs test y **
*********************
1. STATISTICS:
NUMBER OF OBSERVATIONS = 195
MINIMUM = 9.196848
MEAN = 9.261460
MAXIMUM = 9.327973
STANDARD DEVIATION = 0.2278881E-01
Importance Many statistical techniques are sensitive to the presence of outliers. For
example, simple calculations of the mean and standard deviation may
be distorted by a single grossly inaccurate data point.
Checking for outliers should be a routine part of any data analysis.
Potential outliers should be examined to see if they are possibly
erroneous. If the data point is in error, it should be corrected if possible
and deleted if it is not possible. If there is no reason to believe that the
outlying point is in error, it should not be deleted without careful
consideration. However, the use of more robust techniques may be
warranted. Robust techniques will often downweight the effect of
outlying points without deleting them.
Yates Before performing a Yates analysis, the data should be arranged in "Yates order". That
Order is, given k factors, the kth column consists of 2k-1 minus signs (i.e., the low level of the
factor) followed by 2k-1 plus signs (i.e., the high level of the factor). For example, for
a full factorial design with three factors, the design matrix is
- - -
+ - -
- + -
+ + -
- - +
+ - +
- + +
+ + +
Determining the Yates order for fractional factorial designs requires knowledge of the
confounding structure of the fractional factorial design.
Yates
A Yates analysis generates the following output.
Output
1. A factor identifier (from Yates order). The specific identifier will vary
depending on the program used to generate the Yates analysis. Dataplot, for
example, uses the following for a 3-factor model.
1 = factor 1
2 = factor 2
3 = factor 3
12 = interaction of factor 1 and factor 2
13 = interaction of factor 1 and factor 3
23 = interaction of factor 2 and factor 3
123 =interaction of factors 1, 2, and 3
2. Least squares estimated factor effects ordered from largest in magnitude (most
significant) to smallest in magnitude (least significant).
That is, we obtain a ranked list of important factors.
3. A t-value for the individual factor effect estimates. The t-value is computed as
where e is the estimated factor effect and is the standard deviation of the
estimated factor effect.
4. The residual standard deviation that results from the model with the single term
only. That is, the residual standard deviation from the model
response = constant + 0.5 (Xi)
Sample Dataplot generated the following Yates analysis output for the Eddy current data set:
Output
Interpretation In summary, the Yates analysis provides us with the following ranked
of Sample list of important factors along with the estimated effect estimate.
Output 1. X1: effect estimate = 3.1025 ohms
2. X2: effect estimate = -0.8675 ohms
3. X2*X3: effect estimate = 0.2975 ohms
4. X1*X3: effect estimate = 0.2475 ohms
5. X3: effect estimate = 0.2125 ohms
6. X1*X2*X3: effect estimate = 0.1425 ohms
7. X1*X2: effect estimate = 0.1275 ohms
Model From the above Yates output, we can define the potential models from
Selection and the Yates analysis. An important component of a Yates analysis is
Validation selecting the best model from the available potential models.
Once a tentative model has been selected, the error term should follow
the assumptions for a univariate measurement process. That is, the
model should be validated by analyzing the residuals.
Graphical Some analysts may prefer a more graphical presentation of the Yates
Presentation results. In particular, the following plots may be useful:
1. Ordered data plot
2. Ordered absolute effects plot
3. Cumulative residual standard deviation plot
Questions The Yates analysis can be used to answer the following questions:
1. What is the ranked list of factors?
2. What is the goodness-of-fit (as measured by the residual
standard deviation) for the various models?
Case Study The Yates analysis is demonstrated in the Eddy current case study.
Yates For convenience, we list the sample Yates output for the Eddy current data set here.
Table
The last column of the Yates table gives the residual standard deviation for 8 possible
models, each with one more term than the previous model.
Potential For this example, we can summarize the possible prediction equations using the second
Models and last columns of the Yates table:
●
has a residual standard deviation of 1.74106 ohms. Note that this is the default
model. That is, if no factors are important, the model is simply the overall mean.
●
has a residual standard deviation of 0.0 ohms. Note that the model with all
possible terms included will have a zero residual standard deviation. This will
always occur with an unreplicated two-level factorial design.
Model The above step lists all the potential models. From this list, we want to select the most
Selection appropriate model. This requires balancing the following two goals.
1. We want the model to include all important factors.
2. We want the model to be parsimonious. That is, the model should be as simple as
possible.
Note that the residual standard deviation alone is insufficient for determining the most
appropriate model as it will always be decreased by adding additional factors. The next
section describes a number of approaches for determining which factors (and
interactions) to include in the model.
Criteria for The seven criteria that we can use in determining whether to keep a factor in the model can be
Including summarized as follows.
Terms in the 1. Effects: Engineering Significance
Model
2. Effects: Order of Magnitude
3. Effects: Statistical Significance
4. Effects: Probability Plots
5. Averages: Youden Plot
6. Residual Standard Deviation: Engineering Significance
7. Residual Standard Deviation: Statistical Significance
The first four criteria focus on effect estimates with three numeric criteria and one graphical
criteria. The fifth criteria focuses on averages. The last two criteria focus on the residual standard
deviation of the model. We discuss each of these seven criteria in detail in the following sections.
The last section summarizes the conclusions based on all of the criteria.
That is, declare a factor as important if its effect is more than 2 standard deviations away from 0
(0, by definition, meaning "no effect").
The "2" comes from normal theory (more specifically, a value of 1.96 yields a 95% confidence
interval). More precise values would come from t-distribution theory.
The difficulty with this is that in order to invoke this criterion we need the standard deviation, ,
of an observation. This is problematic because
1. the engineer may not know ;
2. the experiment might not have replication, and so a model-free estimate of is not
obtainable;
3. obtaining an estimate of by assuming the sometimes- employed assumption of ignoring
3-term interactions and higher may be incorrect from an engineering point of view.
For the Eddy current example:
1. the engineer did not know ;
2. the design (a 23 full factorial) did not have replication;
This results in keeping three terms: X1 (3.10250), X2 (-.86750), and X1*X2 (.29750).
Normal The following half-normal plot shows the normal probability plot of the effect estimates and the
Probablity half-normal probability plot of the absolute value of the estimates for the Eddy current data.
Plot of
Effects and
Half-Normal
Probability
Plot of
Effects
For the example at hand, both probability plots clearly show two factors displaced off the line,
and from the third plot (with factor tags included), we see that those two factors are factor 1 and
factor 2. All of the remaining five effects are behaving like random drawings from a normal
distribution centered at zero, and so are deemed to be statistically non-significant. In conclusion,
this rule keeps two factors: X1 (3.10250) and X2 (-.86750).
Effects: A Youden plot can be used in the following way. Keep a factor as "important" if it is displaced
Youden Plot away from the central-tendancy "bunch" in a Youden plot of high and low averages. By
definition, a factor is important when its average response for the low (-1) setting is significantly
different from its average response for the high (+1) setting. Conversely, if the low and high
averages are about the same, then what difference does it make which setting to use and so why
would such a factor be considered important? This fact in combination with the intrinsic benefits
of the Youden plot for comparing pairs of items leads to the technique of generating a Youden
plot of the low and high averages.
Youden Plot The following is the Youden plot of the effect estimatess for the Eddy current data.
of Effect
Estimatess
For the example at hand, the Youden plot clearly shows a cluster of points near the grand average
(2.65875) with two displaced points above (factor 1) and below (factor 2). Based on the Youden
plot, we conclude to keep two factors: X1 (3.10250) and X2 (-.86750).
In other words, how good does the engineer want the prediction equation to be. Unfortunately,
this engineering specification has not always been formulated and so this criterion can become
moot.
In the absence of a prior specified cutoff, a good rough rule for the minimum engineering residual
standard deviation is to keep adding terms until the residual standard deviation just dips below,
say, 5% of the current production average. For the Eddy current data, let's say that the average
detector has a sensitivity of 2.5 ohms. Then this would suggest that we would keep adding terms
to the model until the residual standard deviation falls below 5% of 2.5 ohms = 0.125 ohms.
Based on the minimum residual standard deviation criteria, and by scanning the far right column
of the Yates table, we would conclude to keep the following terms:
1. X1 (with a cumulative residual standard deviation = 0.57272)
2. X2 (with a cumulative residual standard deviation = 0.30429)
3. X2*X3 (with a cumulative residual standard deviation = 0.26737)
4. X1*X3 (with a cumulative residual standard deviation = 0.23341)
5. X3 (with a cumulative residual standard deviation = 0.19121)
6. X1*X2*X3 (with a cumulative residual standard deviation = 0.18031)
7. X1*X2 (with a cumulative residual standard deviation = 0.00000)
Note that we must include all terms in order to drive the residual standard deviation below 0.125.
Again, the 5% rule is a rough-and-ready rule that has no basis in engineering or statistics, but is
simply a "numerics". Ideally, the engineer has a better cutoff for the residual standard deviation
that is based on how well he/she wants the equation to peform in practice. If such a number were
available, then for this criterion and data set we would select something less than the entire
collection of terms.
Conclusions In summary, the seven criteria for specifying "important" factors yielded the following for the
Eddy current data:
1. Effects, Engineering Significance: X1, X2
2. Effects, Numerically Significant: X1, X2
3. Effects, Statistically Significant: X1, X2, X2*X3
4. Effects, Probability Plots: X1, X2
5. Averages, Youden Plot: X1, X2
6. Residual SD, Engineering Significance: all 7 terms
7. Residual SD, Statistical Significance: not applicable
Such conflicting results are common. Arguably, the three most important criteria (listed in order
of most important) are:
4. Effects, Probability Plots: X1, X2
1. Effects, Engineering Significance: X1, X2
3. Residual SD, Engineering Significance: all 7 terms
Scanning all of the above, we thus declare the following consensus for the Eddy current data:
1. Important Factors: X1 and X2
2. Parsimonious Prediction Equation:
where j represents all possible values that x can have and pj is the
probability at xj.
Probability For a continuous function, the probability density function (pdf) is the
Density probability that the variate has the value x. Since for continuous
Function distributions the probability at a single point is zero, this is often
expressed in terms of an integral between two points.
For a discrete distribution, the pdf is the probability that the variate takes
the value x.
Cumulative The cumulative distribution function (cdf) is the probability that the
Distribution variable takes a value less than or equal to x. That is
Function
The horizontal axis is the allowable domain for the given probability
function. Since the vertical axis is a probability, it must fall between
zero and one. It increases from zero to one as we go from left to right on
the horizontal axis.
Percent The percent point function (ppf) is the inverse of the cumulative
Point distribution function. For this reason, the percent point function is also
Function commonly referred to as the inverse distribution function. That is, for a
distribution function we calculate the probability that the variable is less
than or equal to x for a given x. For the percent point function, we start
with the probability and compute the corresponding x for the cumulative
distribution. Mathematically, this can be expressed as
or alternatively
Since the horizontal axis is a probability, it goes from zero to one. The
vertical axis goes from the smallest to the largest value of the
cumulative distribution function.
Hazard The hazard function is the ratio of the probability density function to the
Function survival function, S(x).
Cumulative The cumulative hazard function is the integral of the hazard function. It
Hazard can be interpreted as the probability of failure at time x given survival
Function until time x.
Survival Survival functions are most often used in reliability and related fields.
Function The survival function is the probability that the variate takes a value
greater than x.
Inverse Just as the percent point function is the inverse of the cumulative
Survival distribution function, the survival function also has an inverse function.
Function The inverse survival function can be defined in terms of the percent
point function.
PPCC Plots The PPCC plot is an effective graphical tool for selecting the member of
a distributional family with a single shape parameter that best fits a
given set of data.
Location The next plot shows the probability density function for a normal
Parameter distribution with a location parameter of 10 and a scale parameter of 1.
Scale
Parameter
In contrast, the next graph has a scale parameter of 1/3 (=0.333). The
effect of this scale parameter is to squeeze the pdf. That is, the
maximum y value is approximately 1.2 as opposed to 0.4 and the y
value is near zero at (+/-) 1 as opposed to (+/-) 3.
The effect of a scale parameter greater than one is to stretch the pdf. The
greater the magnitude, the greater the stretching. The effect of a scale
parameter less than one is to compress the pdf. The compressing
approaches a spike as the scale parameter goes to zero. A scale
Location The following graph shows the effect of both a location and a scale
and Scale parameter. The plot has been shifted right 10 units and stretched by a
Together factor of 3.
Standard The standard form of any distribution is the form that has location
Form parameter zero and scale parameter one.
It is common in statistical software packages to only compute the
standard form of the distribution. There are formulas for converting
from the standard form to the form with other location and scale
parameters. These formulas are independent of the particular probability
distribution.
Formulas The following are the formulas for computing various probability
for Location functions based on the standard form of the distribution. The parameter
and Scale a refers to the location parameter and the parameter b refers to the scale
Based on parameter. Shape parameters are not included.
the Standard
Cumulative Distribution Function F(x;a,b) = F((x-a)/b;0,1)
Form
Probability Density Function f(x;a,b) = (1/b)f((x-a)/b;0,1)
Percent Point Function G( ;a,b) = a + bG( ;0,1)
Hazard Function h(x;a,b) = (1/b)h((x-a)/b;0,1)
Cumulative Hazard Function H(x;a,b) = H((x-a)/b;0,1)
Survival Function S(x;a,b) = S((x-a)/b;0,1)
Inverse Survival Function Z( ;a,b) = a + bZ( ;0,1)
Random Numbers Y(a,b) = a + bY(0,1)
Relationship For the normal distribution, the location and scale parameters
to Mean and correspond to the mean and standard deviation, respectively. However,
Standard this is not necessarily true for other distributions. In fact, it is not true
Deviation for most distributions.
Various There are various methods, both numerical and graphical, for estimating
Methods the parameters of a probability distribution.
1. Method of moments
2. Maximum likelihood
3. Least squares
4. PPCC and probability plots
Software Most general purpose statistical software does not include explicit
method of moments parameter estimation commands. However, when
utilized, the method of moment formulas tend to be straightforward and
can be easily implemented in most statistical software programs.
Case Study The airplane glass failure time case study demonstrates the use of the
PPCC and probability plots in finding the best distributional model
and the parameter estimation of the distributional model.
Other For reliability applications, the hazard plot and the Weibull plot are
Graphical alternative graphical methods that are commonly used to estimate
Methods parameters.
Continuous
Distributions
Discrete
Distributions
where is the location parameter and is the scale parameter. The case
where = 0 and = 1 is called the standard normal distribution. The
equation for the standard normal distribution is
Cumulative The formula for the cumulative distribution function of the normal
Distribution distribution does not exist in a simple closed formula. It is computed
Function numerically.
The following is the plot of the normal cumulative distribution function.
Percent The formula for the percent point function of the normal distribution
Point does not exist in a simple closed formula. It is computed numerically.
Function
The following is the plot of the normal percent point function.
Hazard The formula for the hazard function of the normal distribution is
Function
Cumulative The normal cumulative hazard function can be computed from the
Hazard normal cumulative distribution function.
Function
The following is the plot of the normal cumulative hazard function.
Survival The normal survival function can be computed from the normal
Function cumulative distribution function.
The following is the plot of the normal survival function.
Inverse The normal inverse survival function can be computed from the normal
Survival percent point function.
Function
The following is the plot of the normal inverse survival function.
Parameter The location and scale parameters of the normal distribution can be
Estimation estimated with the sample mean and sample standard deviation,
respectively.
Comments For both theoretical and practical reasons, the normal distribution is
probably the most important distribution in statistics. For example,
● Many classical statistical tests are based on the assumption that
the data follow a normal distribution. This assumption should be
tested before applying these tests.
● In modeling applications, such as linear and non-linear regression,
the error term is often assumed to follow a normal distribution
with fixed location and scale.
● The normal distribution is used to find significance levels in many
hypothesis tests and confidence intervals.
Theroretical The normal distribution is widely used. Part of the appeal is that it is
Justification well behaved and mathematically tractable. However, the central limit
- Central theorem provides a theoretical basis for why it has wide applicability.
Limit
Theorem The central limit theorem basically states that as the sample size (N)
becomes large, the following occur:
1. The sampling distribution of the mean becomes approximately
normal regardless of the distribution of the original variable.
2. The sampling distribution of the mean is centered at the
population mean, , of the original variable. In addition, the
standard deviation of the sampling distribution of the mean
approaches .
where A is the location parameter and (B - A) is the scale parameter. The case
where A = 0 and B = 1 is called the standard uniform distribution. The
equation for the standard uniform distribution is
Cumulative The formula for the cumulative distribution function of the uniform
Distribution distribution is
Function
Percent The formula for the percent point function of the uniform distribution is
Point
Function
The following is the plot of the uniform percent point function.
Hazard The formula for the hazard function of the uniform distribution is
Function
Cumulative The formula for the cumulative hazard function of the uniform distribution is
Hazard
Function
The following is the plot of the uniform cumulative hazard function.
Survival The uniform survival function can be computed from the uniform cumulative
Function distribution function.
The following is the plot of the uniform survival function.
Inverse The uniform inverse survival function can be computed from the uniform
Survival percent point function.
Function
The following is the plot of the uniform inverse survival function.
Coefficient of
Variation
Coefficient of 0
Skewness
Coefficient of Kurtosis 9/5
Comments The uniform distribution defines equal probability over a given range for a
continuous distribution. For this reason, it is important as a reference
distribution.
One of the most important applications of the uniform distribution is in the
generation of random numbers. That is, almost all random number generators
generate random numbers on the (0,1) interval. For other distributions, some
transformation is applied to the uniform random numbers.
where t is the location parameter and s is the scale parameter. The case
where t = 0 and s = 1 is called the standard Cauchy distribution. The
equation for the standard Cauchy distribution reduces to
Cumulative The formula for the cumulative distribution function for the Cauchy
Distribution distribution is
Function
Percent The formula for the percent point function of the Cauchy distribution is
Point
Function
The following is the plot of the Cauchy percent point function.
Hazard The Cauchy hazard function can be computed from the Cauchy
Function probability density and cumulative distribution functions.
The following is the plot of the Cauchy hazard function.
Cumulative The Cauchy cumulative hazard function can be computed from the
Hazard Cauchy cumulative distribution function.
Function
The following is the plot of the Cauchy cumulative hazard function.
Survival The Cauchy survival function can be computed from the Cauchy
Function cumulative distribution function.
The following is the plot of the Cauchy survival function.
Inverse The Cauchy inverse survival function can be computed from the Cauchy
Survival percent point function.
Function
The following is the plot of the Cauchy inverse survival function.
Parameter The likelihood functions for the Cauchy maximum likelihood estimates
Estimation are given in chapter 16 of Johnson, Kotz, and Balakrishnan. These
equations typically must be solved numerically on a computer.
1.3.6.6.4. t Distribution
Probability The formula for the probability density function of the t distribution is
Density
Function
These plots all have a similar shape. The difference is in the heaviness
of the tails. In fact, the t distribution with equal to 1 is a Cauchy
distribution. The t distribution approaches a normal distribution as
becomes large. The approximation is quite good for values of > 30.
Cumulative The formula for the cumulative distribution function of the t distribution
Distribution is complicated and is not included here. It is given in the Evans,
Function Hastings, and Peacock book.
The following are the plots of the t cumulative distribution function with
the same values of as the pdf plots above.
Percent The formula for the percent point function of the t distribution does not
Point exist in a simple closed form. It is computed numerically.
Function
The following are the plots of the t percent point function with the same
values of as the pdf plots above.
Other Since the t distribution is typically used to develop hypothesis tests and
Probability confidence intervals and rarely for modeling applications, we omit the
Functions formulas and plots for the hazard, cumulative hazard, survival, and
inverse survival probability functions.
Parameter Since the t distribution is typically used to develop hypothesis tests and
Estimation confidence intervals and rarely for modeling applications, we omit any
discussion of parameter estimation.
Comments The t distribution is used in many cases for the critical regions for
hypothesis tests and in determining confidence intervals. The most
common example is testing if data are consistent with the assumed
process mean.
1.3.6.6.5. F Distribution
Probability The F distribution is the ratio of two chi-square distributions with
Density degrees of freedom and , respectively. The formula for the
Function probability density function of the F distribution is
where and are the shape parameters and is the gamma function.
The formula for the gamma function is
Percent The formula for the percent point function of the F distribution does not
Point exist in a simple closed form. It is computed numerically.
Function
The following is the plot of the F percent point function with the same
values of and as the pdf plots above.
Other Since the F distribution is typically used to develop hypothesis tests and
Probability confidence intervals and rarely for modeling applications, we omit the
Functions formulas and plots for the hazard, cumulative hazard, survival, and
inverse survival probability functions.
Common The formulas below are for the case where the location parameter is
Statistics zero and the scale parameter is one.
Mean
Mode
Coefficient of
Variation
Coefficient of
Skewness
Parameter Since the F distribution is typically used to develop hypothesis tests and
Estimation confidence intervals and rarely for modeling applications, we omit any
discussion of parameter estimation.
Comments The F distribution is used in many cases for the critical regions for
hypothesis tests and in determining confidence intervals. Two common
examples are the analysis of variance and the F test to determine if the
variances of two populations are equal.
Cumulative The formula for the cumulative distribution function of the chi-square
Distribution distribution is
Function
Percent The formula for the percent point function of the chi-square distribution
Point does not exist in a simple closed form. It is computed numerically.
Function
The following is the plot of the chi-square percent point function with
the same values of as the pdf plots above.
Common Mean
Statistics Median approximately - 2/3 for large
Mode
Range 0 to positive infinity
Standard Deviation
Coefficient of
Variation
Coefficient of
Skewness
Coefficient of
Kurtosis
Comments The chi-square distribution is used in many cases for the critical regions
for hypothesis tests and in determining confidence intervals. Two
common examples are the chi-square test for independence in an RxC
contingency table and the chi-square test to determine if the standard
deviation of a population is equal to a pre-specified value.
Cumulative The formula for the cumulative distribution function of the exponential
Distribution distribution is
Function
Percent The formula for the percent point function of the exponential
Point distribution is
Function
Hazard The formula for the hazard function of the exponential distribution is
Function
Cumulative The formula for the cumulative hazard function of the exponential
Hazard distribution is
Function
Survival The formula for the survival function of the exponential distribution is
Function
Inverse The formula for the inverse survival function of the exponential
Survival distribution is
Function
Common Mean
Statistics Median
Mode Zero
Range Zero to plus infinity
Standard Deviation
Coefficient of 1
Variation
Coefficient of 2
Skewness
Coefficient of 9
Kurtosis
Parameter For the full sample case, the maximum likelihood estimator of the scale
Estimation parameter is the sample mean. Maximum likelihood estimation for the
exponential distribution is discussed in the chapter on reliability
(Chapter 8). It is also discussed in chapter 19 of Johnson, Kotz, and
Balakrishnan.
where is the shape parameter, is the location parameter and is the scale
parameter. The case where = 0 and = 1 is called the standard Weibull
distribution. The case where = 0 is called the 2-parameter Weibull distribution.
The equation for the standard Weibull distribution reduces to
Since the general form of probability functions can be expressed in terms of the
standard distribution, all subsequent formulas in this section are given for the
standard form of the function.
The following is the plot of the Weibull probability density function.
Cumulative The formula for the cumulative distribution function of the Weibull distribution is
Distribution
Function
The following is the plot of the Weibull cumulative distribution function with the
same values of as the pdf plots above.
Percent The formula for the percent point function of the Weibull distribution is
Point
Function
The following is the plot of the Weibull percent point function with the same
values of as the pdf plots above.
Hazard The formula for the hazard function of the Weibull distribution is
Function
The following is the plot of the Weibull hazard function with the same values of
as the pdf plots above.
Cumulative The formula for the cumulative hazard function of the Weibull distribution is
Hazard
Function
The following is the plot of the Weibull cumulative hazard function with the same
values of as the pdf plots above.
Survival The formula for the survival function of the Weibull distribution is
Function
The following is the plot of the Weibull survival function with the same values of
as the pdf plots above.
Inverse The formula for the inverse survival function of the Weibull distribution is
Survival
Function
The following is the plot of the Weibull inverse survival function with the same
values of as the pdf plots above.
Common The formulas below are with the location parameter equal to zero and the scale
Statistics parameter equal to one.
Mean
Median
Mode
Coefficient of Variation
Parameter Maximum likelihood estimation for the Weibull distribution is discussed in the
Estimation Reliability chapter (Chapter 8). It is also discussed in Chapter 21 of Johnson, Kotz,
and Balakrishnan.
Software Most general purpose statistical software programs, including Dataplot, support at
least some of the probability functions for the Weibull distribution.
Cumulative The formula for the cumulative distribution function of the lognormal
Distribution distribution is
Function
Percent The formula for the percent point function of the lognormal distribution
Point is
Function
Hazard The formula for the hazard function of the lognormal distribution is
Function
Cumulative The formula for the cumulative hazard function of the lognormal
Hazard distribution is
Function
Survival The formula for the survival function of the lognormal distribution is
Function
The following is the plot of the lognormal survival function with the
same values of as the pdf plots above.
Inverse The formula for the inverse survival function of the lognormal
Survival distribution is
Function
Common The formulas below are with the location parameter equal to zero and
Statistics the scale parameter equal to one.
Mean
Median Scale parameter m (= 1 if scale parameter not
specified).
Mode
Coefficient of
Skewness
Coefficient of
Kurtosis
Coefficient of
Variation
Parameter The maximum likelihood estimates for the scale parameter, m, and the
Estimation shape parameter, , are
and
where
Since the general form of probability functions can be expressed in terms of the
standard distribution, all subsequent formulas in this section are given for the
standard form of the function.
The following is the plot of the fatigue life probability density function.
Cumulative The formula for the cumulative distribution function of the fatigue life
Distribution distribution is
Function
Percent The formula for the percent point function of the fatigue life distribution is
Point
Function
where is the percent point function of the standard normal distribution. The
following is the plot of the fatigue life percent point function with the same
values of as the pdf plots above.
Hazard The fatigue life hazard function can be computed from the fatigue life probability
Function density and cumulative distribution functions.
The following is the plot of the fatigue life hazard function with the same values
of as the pdf plots above.
Cumulative The fatigue life cumulative hazard function can be computed from the fatigue life
Hazard cumulative distribution function.
Function
The following is the plot of the fatigue cumulative hazard function with the same
values of as the pdf plots above.
Survival The fatigue life survival function can be computed from the fatigue life
Function cumulative distribution function.
The following is the plot of the fatigue survival function with the same values of
as the pdf plots above.
Inverse The fatigue life inverse survival function can be computed from the fatigue life
Survival percent point function.
Function
The following is the plot of the gamma inverse survival function with the same
values of as the pdf plots above.
Common The formulas below are with the location parameter equal to zero and the scale
Statistics parameter equal to one.
Mean
Coefficient of Variation
Parameter Maximum likelihood estimation for the fatigue life distribution is discussed in the
Estimation Reliability chapter.
Comments The fatigue life distribution is used extensively in reliability applications to model
failure times.
Software Some general purpose statistical software programs, including Dataplot, support
at least some of the probability functions for the fatigue life distribution. Support
for this distribution is likely to be available for statistical programs that
emphasize reliability applications.
Cumulative The formula for the cumulative distribution function of the gamma
Distribution distribution is
Function
Percent The formula for the percent point function of the gamma distribution
Point does not exist in a simple closed form. It is computed numerically.
Function
The following is the plot of the gamma percent point function with the
same values of as the pdf plots above.
Hazard The formula for the hazard function of the gamma distribution is
Function
The following is the plot of the gamma hazard function with the same
values of as the pdf plots above.
Cumulative The formula for the cumulative hazard function of the gamma
Hazard distribution is
Function
Survival The formula for the survival function of the gamma distribution is
Function
Inverse The gamma inverse survival function does not exist in simple closed
Survival form. It is computed numberically.
Function
The following is the plot of the gamma inverse survival function with
the same values of as the pdf plots above.
Common The formulas below are with the location parameter equal to zero and
Statistics the scale parameter equal to one.
Mean
Mode
Range Zero to positive infinity.
Standard Deviation
Skewness
Kurtosis
Coefficient of
Variation
where and s are the sample mean and standard deviation, respectively.
The equations for the maximum likelihood estimation of the shape and
scale parameters are given in Chapter 18 of Evans, Hastings, and
Peacock and Chapter 17 of Johnson, Kotz, and Balakrishnan. These
equations need to be solved numerically; this is typically accomplished
by using statistical software packages.
Cumulative The formula for the cumulative distribution function of the double
Distribution exponential distribution is
Function
Percent The formula for the percent point function of the double exponential
Point distribution is
Function
Hazard The formula for the hazard function of the double exponential
Function distribution is
Cumulative The formula for the cumulative hazard function of the double
Hazard exponential distribution is
Function
Survival The double exponential survival function can be computed from the
Function cumulative distribution function of the double exponential distribution.
The following is the plot of the double exponential survival function.
Inverse The formula for the inverse survival function of the double exponential
Survival distribution is
Function
Common Mean
Statistics Median
Mode
Range Negative infinity to positive infinity
Standard Deviation
Skewness 0
Kurtosis 6
Coefficient of
Variation
Parameter The maximum likelihood estimators of the location and scale parameters
Estimation of the double exponential distribution are
Cumulative The formula for the cumulative distribution function of the power
Distribution normal distribution is
Function
Percent The formula for the percent point function of the power normal
Point distribution is
Function
Hazard The formula for the hazard function of the power normal distribution is
Function
The following is the plot of the power normal hazard function with the
same values of p as the pdf plots above.
Cumulative The formula for the cumulative hazard function of the power normal
Hazard distribution is
Function
Survival The formula for the survival function of the power normal distribution is
Function
The following is the plot of the power normal survival function with the
same values of p as the pdf plots above.
Inverse The formula for the inverse survival function of the power normal
Survival distribution is
Function
The following is the plot of the power normal inverse survival function
with the same values of p as the pdf plots above.
Common The statistics for the power normal distribution are complicated and
Statistics require tables. Nelson discusses the mean, median, mode, and standard
deviation of the power normal distribution and provides references to
the appropriate tables.
Software Most general purpose statistical software programs do not support the
probability functions for the power normal distribution. Dataplot does
support them.
where p (also referred to as the power parameter) and are the shape parameters,
is the cumulative distribution function of the standard normal distribution, and
is the probability density function of the standard normal distribution.
Cumulative The formula for the cumulative distribution function of the power lognormal
Distribution distribution is
Function
Percent The formula for the percent point function of the power lognormal distribution is
Point
Function
Hazard The formula for the hazard function of the power lognormal distribution is
Function
The following is the plot of the power lognormal hazard function with the same
values of p as the pdf plots above.
Cumulative The formula for the cumulative hazard function of the power lognormal
Hazard distribution is
Function
The following is the plot of the power lognormal cumulative hazard function with
the same values of p as the pdf plots above.
Survival The formula for the survival function of the power lognormal distribution is
Function
The following is the plot of the power lognormal survival function with the same
values of p as the pdf plots above.
Inverse The formula for the inverse survival function of the power lognormal distribution is
Survival
Function
The following is the plot of the power lognormal inverse survival function with the
same values of p as the pdf plots above.
Common The statistics for the power lognormal distribution are complicated and require
Statistics tables. Nelson discusses the mean, median, mode, and standard deviation of the
power lognormal distribution and provides references to the appropriate tables.
Parameter Nelson discusses maximum likelihood estimation for the power lognormal
Estimation distribution. These estimates need to be performed with computer software.
Software for maximum likelihood estimation of the parameters of the power
lognormal distribution is not as readily available as for other reliability
distributions such as the exponential, Weibull, and lognormal.
Software Most general purpose statistical software programs do not support the probability
functions for the power lognormal distribution. Dataplot does support them.
Cumulative The Tukey-Lambda distribution does not have a simple, closed form. It
Distribution is computed numerically.
Function
The following is the plot of the Tukey-Lambda cumulative distribution
function with the same values of as the pdf plots above.
Percent The formula for the percent point function of the standard form of the
Point Tukey-Lambda distribution is
Function
Software Most general purpose statistical software programs do not support the
probability functions for the Tukey-Lambda distribution. Dataplot does
support them.
The following is the plot of the Gumbel probability density function for
the minimum case.
The general formula for the probability density function of the Gumbel
(maximum) distribution is
The following is the plot of the Gumbel probability density function for
the maximum case.
Cumulative The formula for the cumulative distribution function of the Gumbel
Distribution distribution (minimum) is
Function
Percent The formula for the percent point function of the Gumbel distribution
Point (minimum) is
Function
The following is the plot of the Gumbel percent point function for the
minimum case.
The formula for the percent point function of the Gumbel distribution
(maximum) is
The following is the plot of the Gumbel percent point function for the
maximum case.
Hazard The formula for the hazard function of the Gumbel distribution
Function (minimum) is
The following is the plot of the Gumbel hazard function for the
minimum case.
(maximum) is
The following is the plot of the Gumbel hazard function for the
maximum case.
Cumulative The formula for the cumulative hazard function of the Gumbel
Hazard distribution (minimum) is
Function
The following is the plot of the Gumbel cumulative hazard function for
the minimum case.
The following is the plot of the Gumbel cumulative hazard function for
the maximum case.
Survival The formula for the survival function of the Gumbel distribution
Function (minimum) is
The following is the plot of the Gumbel survival function for the
minimum case.
The following is the plot of the Gumbel survival function for the
maximum case.
Inverse The formula for the inverse survival function of the Gumbel distribution
Survival (minimum) is
Function
The following is the plot of the Gumbel inverse survival function for the
minimum case.
The formula for the inverse survival function of the Gumbel distribution
(maximum) is
The following is the plot of the Gumbel inverse survival function for the
maximum case.
Common The formulas below are for the maximum order statistic case.
Statistics Mean
Skewness 1.13955
Kurtosis 5.4
Coefficient of
Variation
where
The following is the plot of the binomial probability density function for four
values of p and n = 100.
The following is the plot of the binomial cumulative distribution function with
the same values of p as the pdf plots above.
Percent The binomial percent point function does not exist in simple closed form. It is
Point computed numerically. Note that because this is a discrete distribution that is
Function only defined for integer values of x, the percent point function is not smooth in
the way the percent point function typically is for a continuous distribution.
The following is the plot of the binomial percent point function with the same
values of p as the pdf plots above.
Common Mean
Statistics Mode
Range 0 to N
Standard Deviation
Coefficient of
Variation
Coefficient of
Skewness
Coefficient of Kurtosis
Comments The binomial distribution is probably the most commonly used discrete
distribution.
Software Most general purpose statistical software programs, including Dataplot, support
at least some of the probability functions for the binomial distribution.
Percent The Poisson percent point function does not exist in simple closed form.
Point It is computed numerically. Note that because this is a discrete
Function distribution that is only defined for integer values of x, the percent point
function is not smooth in the way the percent point function typically is
for a continuous distribution.
The following is the plot of the Poisson percent point function with the
same values of as the pdf plots above.
Common Mean
Statistics Mode For non-integer , it is the largest integer less
than . For integer , x = and x = - 1 are
both the mode.
Range 0 to positive infinity
Standard Deviation
Coefficient of
Variation
Coefficient of
Skewness
Coefficient of
Kurtosis
X 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.00000 0.00399 0.00798 0.01197 0.01595 0.01994 0.02392 0.02790 0.03188
0.03586
0.1 0.03983 0.04380 0.04776 0.05172 0.05567 0.05962 0.06356 0.06749 0.07142
0.07535
0.2 0.07926 0.08317 0.08706 0.09095 0.09483 0.09871 0.10257 0.10642 0.11026
0.11409
0.3 0.11791 0.12172 0.12552 0.12930 0.13307 0.13683 0.14058 0.14431 0.14803
0.15173
0.4 0.15542 0.15910 0.16276 0.16640 0.17003 0.17364 0.17724 0.18082 0.18439
0.18793
0.5 0.19146 0.19497 0.19847 0.20194 0.20540 0.20884 0.21226 0.21566 0.21904
0.22240
0.6 0.22575 0.22907 0.23237 0.23565 0.23891 0.24215 0.24537 0.24857 0.25175
0.25490
0.7 0.25804 0.26115 0.26424 0.26730 0.27035 0.27337 0.27637 0.27935 0.28230
0.28524
0.8 0.28814 0.29103 0.29389 0.29673 0.29955 0.30234 0.30511 0.30785 0.31057
0.31327
0.9 0.31594 0.31859 0.32121 0.32381 0.32639 0.32894 0.33147 0.33398 0.33646
0.33891
1.0 0.34134 0.34375 0.34614 0.34849 0.35083 0.35314 0.35543 0.35769 0.35993
0.36214
1.1 0.36433 0.36650 0.36864 0.37076 0.37286 0.37493 0.37698 0.37900 0.38100
0.38298
1.2 0.38493 0.38686 0.38877 0.39065 0.39251 0.39435 0.39617 0.39796 0.39973
0.40147
1.3 0.40320 0.40490 0.40658 0.40824 0.40988 0.41149 0.41308 0.41466 0.41621
0.41774
1.4 0.41924 0.42073 0.42220 0.42364 0.42507 0.42647 0.42785 0.42922 0.43056
0.43189
\ 1 2 3 4 5 6 7 8
9 10
\ 11 12 13 14 15 16 17 18
19 20
\ 1 2 3 4 5 6 7 8
9 10
\ 11 12 13 14 15 16 17 18
19 20
\ 1 2 3 4 5 6 7 8
9 10
\ 11 12 13 14 15 16 17 18
19 20
1 37.544 61 2.455
2 7.582 62 2.454
3 4.826 63 2.453
4 3.941 64 2.452
5 3.518 65 2.451
6 3.274 66 2.450
7 3.115 67 2.449
8 3.004 68 2.448
9 2.923 69 2.447
10 2.860 70 2.446
11 2.811 71 2.445
12 2.770 72 2.445
13 2.737 73 2.444
14 2.709 74 2.443
15 2.685 75 2.442
16 2.665 76 2.441
17 2.647 77 2.441
18 2.631 78 2.440
19 2.617 79 2.439
20 2.605 80 2.439
21 2.594 81 2.438
22 2.584 82 2.437
23 2.574 83 2.437
24 2.566 84 2.436
25 2.558 85 2.436
26 2.551 86 2.435
27 2.545 87 2.435
28 2.539 88 2.434
29 2.534 89 2.434
30 2.528 90 2.433
31 2.524 91 2.432
32 2.519 92 2.432
33 2.515 93 2.431
34 2.511 94 2.431
35 2.507 95 2.431
36 2.504 96 2.430
37 2.501 97 2.430
38 2.498 98 2.429
39 2.495 99 2.429
40 2.492 100 2.428
41 2.489 101 2.428
42 2.487 102 2.428
43 2.484 103 2.427
44 2.482 104 2.427
45 2.480 105 2.426
46 2.478 106 2.426
47 2.476 107 2.426
48 2.474 108 2.425
49 2.472 109 2.425
50 2.470 110 2.425
51 2.469 111 2.424
52 2.467 112 2.424
53 2.466 113 2.424
54 2.464 114 2.423
55 2.463 115 2.423
56 2.461 116 2.423
57 2.460 117 2.422
58 2.459 118 2.422
59 2.457 119 2.422
60 2.456 120 2.422
Critical values of the normal PPCC for testing if data come from
a normal distribution
N 0.05 0.01
3 0.8687 0.8790
4 0.8234 0.8666
5 0.8240 0.8786
6 0.8351 0.8880
7 0.8474 0.8970
8 0.8590 0.9043
9 0.8689 0.9115
10 0.8765 0.9173
11 0.8838 0.9223
12 0.8918 0.9267
13 0.8974 0.9310
14 0.9029 0.9343
15 0.9080 0.9376
16 0.9121 0.9405
17 0.9160 0.9433
18 0.9196 0.9452
19 0.9230 0.9479
20 0.9256 0.9498
21 0.9285 0.9515
22 0.9308 0.9535
23 0.9334 0.9548
24 0.9356 0.9564
25 0.9370 0.9575
26 0.9393 0.9590
27 0.9413 0.9600
28 0.9428 0.9615
29 0.9441 0.9622
30 0.9462 0.9634
31 0.9476 0.9644
32 0.9490 0.9652
33 0.9505 0.9661
34 0.9521 0.9671
35 0.9530 0.9678
36 0.9540 0.9686
37 0.9551 0.9693
38 0.9555 0.9700
39 0.9568 0.9704
40 0.9576 0.9712
41 0.9589 0.9719
42 0.9593 0.9723
43 0.9609 0.9730
44 0.9611 0.9734
45 0.9620 0.9739
46 0.9629 0.9744
47 0.9637 0.9748
48 0.9640 0.9753
49 0.9643 0.9758
50 0.9654 0.9761
55 0.9683 0.9781
60 0.9706 0.9797
65 0.9723 0.9809
70 0.9742 0.9822
75 0.9758 0.9831
80 0.9771 0.9841
85 0.9784 0.9850
90 0.9797 0.9857
95 0.9804 0.9864
100 0.9814 0.9869
110 0.9830 0.9881
120 0.9841 0.9889
130 0.9854 0.9897
140 0.9865 0.9904
150 0.9871 0.9909
160 0.9879 0.9915
170 0.9887 0.9919
180 0.9891 0.9923
190 0.9897 0.9927
200 0.9903 0.9930
210 0.9907 0.9933
220 0.9910 0.9936
230 0.9914 0.9939
240 0.9917 0.9941
250 0.9921 0.9943
260 0.9924 0.9945
270 0.9926 0.9947
280 0.9929 0.9949
290 0.9931 0.9951
300 0.9933 0.9952
310 0.9936 0.9954
320 0.9937 0.9955
330 0.9939 0.9956
340 0.9941 0.9957
350 0.9942 0.9958
360 0.9944 0.9959
370 0.9945 0.9960
380 0.9947 0.9961
390 0.9948 0.9962
400 0.9949 0.9963
410 0.9950 0.9964
420 0.9951 0.9965
430 0.9953 0.9966
440 0.9954 0.9966
450 0.9954 0.9967
460 0.9955 0.9968
470 0.9956 0.9968
480 0.9957 0.9969
490 0.9958 0.9969
500 0.9959 0.9970
525 0.9961 0.9972
550 0.9963 0.9973
Table of 1. Introduction
Contents for
Section 4 2. By Problem Category
Yi = C + Ei
If the above assumptions are satisfied, the process is said to be
statistically "in control" with the core characteristic of having
"predictability". That is, probability statements can be made about the
process, not only in the past, but also in the future.
An appropriate model for an "in control" process is
Yi = C + Ei
where C is a constant (the "deterministic" or "structural" component),
and where Ei is the error term (or "random" component).
is valid.
4. Distributional tests assist in determining a better estimator, if
needed.
5. Simulator tools (namely bootstrapping) provide values for the
uncertainty of alternative estimators.
Assumptions If one or more of the above assumptions is not satisfied, then we use
not satisfied EDA techniques, or some mix of EDA and classical techniques, to
find a more appropriate model for the data. That is,
Yi = D + Ei
where D is the deterministic part and E is an error component.
If the data are not random, then we may investigate fitting some
simple time series models to the data. If the constant location and
scale assumptions are violated, we may need to investigate the
measurement process to see if there is an explanation.
The assumptions on the error term are still quite relevant in the sense
that for an appropriate model the error component should follow the
assumptions. The criterion for validating the model, or comparing
competing models, is framed in terms of these assumptions.
Multivariable Although the case studies in this chapter utilize univariate data, the
data assumptions above are relevant for multivariable data as well.
If the data are not univariate, then we are trying to find a model
Yi = F(X1, ..., Xk) + Ei
where F is some function based on one or more variables. The error
component, which is a univariate data set, of a good model should
satisfy the assumptions given above. The criterion for validating and
comparing models is based on how well the error component follows
these assumptions.
The load cell calibration case study in the process modeling chapter
shows an example of this in the regression context.
First three The first three case studies utilize data that are randomly generated
case studies from the following distributions:
utilize data ● normal distribution with mean 0 and standard deviation 1
with known
● uniform distribution with mean 0 and standard deviation
characteristics
(uniform over the interval (0,1))
random walk
●
The other univariate case studies utilize data from scientific processes.
The goal is to determine if
Yi = C + Ei
is a reasonable model. This is done by testing the underlying
assumptions. If the assumptions are satisfied, then an estimate of C
and an estimate of the uncertainty of C are computed. If the
assumptions are not satisfied, we attempt to find a model where the
error component does satisfy the underlying assumptions.
Graphical To test the underlying assumptions, each data set is analyzed using
methods that four graphical methods that are particularly suited for this purpose:
are applied to 1. run sequence plot which is useful for detecting shifts of location
the data or scale
2. lag plot which is useful for detecting non-randomness in the
data
3. histogram which is useful for trying to determine the underlying
distribution
4. normal probability plot for deciding whether the data follow the
normal distribution
There are a number of other techniques for addressing the underlying
Quantitative The normal and uniform random number data sets are also analyzed
methods that with the following quantitative techniques, which are explained in
are applied to more detail in an earlier section:
the data 1. Summary statistics which include:
❍ mean
❍ standard deviation
❍ autocorrelation coefficient to test for randomness
❍ normal and uniform probability plot correlation
coefficients (ppcc) to test for a normal or uniform
distribution, respectively
❍ Wilk-Shapiro test for a normal distribution
Reliability
Airplane Glass
Failure Time
Multi-Factor
Ceramic Strength
Resulting The following is the set of normal random numbers used for this case
Data study.
4-Plot of
Data
Run
Sequence
Plot
Lag Plot
Histogram
(with
overlaid
Normal PDF)
Normal
Probability
Plot
SUMMARY
***********************************************************************
* LOCATION MEASURES * DISPERSION MEASURES *
***********************************************************************
* MIDRANGE = 0.3945000E+00 * RANGE = 0.6083000E+01 *
* MEAN = -0.2935997E-02 * STAND. DEV. = 0.1021041E+01 *
* MIDMEAN = 0.1623600E-01 * AV. AB. DEV. = 0.8174360E+00 *
* MEDIAN = -0.9300000E-01 * MINIMUM = -0.2647000E+01 *
* = * LOWER QUART. = -0.7204999E+00 *
* = * LOWER HINGE = -0.7210000E+00 *
* = * UPPER HINGE = 0.6455001E+00 *
* = * UPPER QUART. = 0.6447501E+00 *
* = * MAXIMUM = 0.3436000E+01 *
***********************************************************************
* RANDOMNESS MEASURES * DISTRIBUTIONAL MEASURES *
***********************************************************************
* AUTOCO COEF = 0.4505888E-01 * ST. 3RD MOM. = 0.3072273E+00 *
* = 0.0000000E+00 * ST. 4TH MOM. = 0.2990314E+01 *
* = 0.0000000E+00 * ST. WILK-SHA = 0.7515639E+01 *
* = * UNIFORM PPCC = 0.9756625E+00 *
* = * NORMAL PPCC = 0.9961721E+00 *
* = * TUK -.5 PPCC = 0.8366451E+00 *
* = * CAUCHY PPCC = 0.4922674E+00 *
***********************************************************************
Location One way to quantify a change in location over time is to fit a straight line to the data set,
using the index variable X = 1, 2, ..., N, with N denoting the number of observations. If
there is no significant drift in the location, the slope parameter should be zero. For this data
set, Dataplot generated the following output:
The slope parameter, A1, has a t value of -0.13 which is statistically not significant. This
indicates that the slope can in fact be considered zero.
Variation One simple way to detect a change in variation is with a Bartlett test, after dividing the data
set into several equal-sized intervals. The choice of the number of intervals is somewhat
arbitrary, although values of 4 or 8 are reasonable. Dataplot generated the following output
for the Bartlett test.
BARTLETT TEST
(STANDARD DEFINITION)
NULL HYPOTHESIS UNDER TEST--ALL SIGMA(I) ARE EQUAL
TEST:
DEGREES OF FREEDOM = 3.000000
In this case, the Bartlett test indicates that the standard deviations are not significantly
different in the 4 intervals.
Randomness
There are many ways in which data can be non-random. However, most common forms of
non-randomness can be detected with a few simple tests. The lag plot in the 4-plot above is
a simple graphical technique.
Another check is an autocorrelation plot that shows the autocorrelations for various lags.
Confidence bands can be plotted at the 95% and 99% confidence levels. Points outside this
band indicate statistically significant values (lag 0 is always 1). Dataplot generated the
following autocorrelation plot.
The lag 1 autocorrelation, which is generally the one of most interest, is 0.045. The critical
values at the 5% significance level are -0.087 and 0.087. Thus, since 0.045 is in the interval,
the lag 1 autocorrelation is not statistically significant, so there is no evidence of
non-randomness.
A common test for randomness is the runs test.
RUNS UP
STATISTIC = NUMBER OF RUNS UP
OF LENGTH EXACTLY I
I STAT EXP(STAT) SD(STAT) Z
Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically
significant at the 5% level. The runs test does not indicate any significant non-randomness.
Distributional Probability plots are a graphical test for assessing if a particular distribution provides an
Analysis adequate fit to a data set.
A quantitative enhancement to the probability plot is the correlation coefficient of the points
on the probability plot. For this data set the correlation coefficient is 0.996. Since this is
greater than the critical value of 0.987 (this is a tabulated value), the normality assumption
is not rejected.
Chi-square and Kolmogorov-Smirnov goodness-of-fit tests are alternative methods for
assessing distributional adequacy. The Wilk-Shapiro and Anderson-Darling tests can be
used to test for normality. Dataplot generates the following output for the Anderson-Darling
normality test.
1. STATISTICS:
NUMBER OF OBSERVATIONS = 500
MEAN = -0.2935997E-02
STANDARD DEVIATION = 1.021041
2. CRITICAL VALUES:
90 % POINT = 0.6560000
95 % POINT = 0.7870000
97.5 % POINT = 0.9180000
99 % POINT = 1.092000
Outlier A test for outliers is the Grubbs test. Dataplot generated the following output for Grubbs'
Analysis test.
1. STATISTICS:
NUMBER OF OBSERVATIONS = 500
MINIMUM = -2.647000
MEAN = -0.2935997E-02
MAXIMUM = 3.436000
STANDARD DEVIATION = 1.021041
Model Since the underlying assumptions were validated both graphically and analytically, we
conclude that a reasonable model for the data is:
Yi = -0.00294 + Ei
We can express the uncertainty for C as the 95% confidence interval (-0.09266,0.086779).
Univariate It is sometimes useful and convenient to summarize the above results in a report. The report
Report for the 500 normal random numbers follows.
2: Location
Mean = -0.00294
Standard Deviation of Mean = 0.045663
95% Confidence Interval for Mean = (-0.09266,0.086779)
Drift with respect to location? = NO
3: Variation
Standard Deviation = 1.021042
95% Confidence Interval for SD = (0.961437,1.088585)
Drift with respect to variation?
(based on Bartletts test on quarters
of the data) = NO
4: Distribution
Normal PPCC = 0.996173
5: Randomness
Autocorrelation = 0.045059
Data are Random?
(as measured by autocorrelation) = YES
6: Statistical Control
(i.e., no drift in location or scale,
data are random, distribution is
fixed, here we are testing only for
fixed normal)
Data Set is in Statistical Control? = YES
7: Outliers?
(as determined by Grubbs' test) = NO
Click on the links below to start Dataplot and run this case study
The links in this column will connect you with more detailed
yourself. Each step may use results from previous steps, so please be
information about each analysis step from the case study
patient. Wait until the software verifies that the current step is
description.
complete before clicking on the next step.
1. Generate a run sequence plot. 1. The run sequence plot indicates that
there are no shifts of location or
scale.
2. Generate a lag plot. 2. The lag plot does not indicate any
significant patterns (which would
show the data were not random).
5. Check for normality by computing the 5. The normal probability plot correlation
normal probability plot correlation coefficient is 0.996. At the 5% level,
coefficient. we cannot reject the normality assumption.
6. Check for outliers using Grubbs' test. 6. Grubbs' test detects no outliers at the
5% level.
Resulting The following is the set of uniform random numbers used for this case
Data study.
4-Plot of
Data
Run
Sequence
Plot
Lag Plot
Histogram
(with
overlaid
Normal PDF)
This plot shows that a normal distribution is a poor fit. The flatness of
the histogram suggests that a uniform distribution might be a better fit.
Histogram
(with
overlaid
Uniform
PDF)
Since the histogram from the 4-plot suggested that the uniform
distribution might be a good fit, we overlay a uniform distribution on
top of the histogram. This indicates a much better fit than a normal
distribution.
Normal
Probability
Plot
As with the histogram, the normal probability plot shows that the
normal distribution does not fit these data well.
Uniform
Probability
Plot
Better Model Since the data follow the underlying assumptions, but with a uniform
distribution rather than a normal distribution, we would still like to
characterize C by a typical value plus or minus a confidence interval.
In this case, we would like to find a location estimator with the
smallest variability.
The bootstrap plot is an ideal tool for this purpose. The following plots
show the bootstrap plot, with the corresponding histogram, for the
mean, median, mid-range, and median absolute deviation.
Bootstrap
Plots
Mid-Range is From the above histograms, it is obvious that for these data, the
Best mid-range is far superior to the mean or median as an estimate for
location.
Using the mean, the location estimate is 0.507 and a 95% confidence
interval for the mean is (0.482,0.534). Using the mid-range, the
location estimate is 0.499 and the 95% confidence interval for the
mid-range is (0.497,0.503).
Although the values for the location are similar, the difference in the
uncertainty intervals is quite large.
Note that in the case of a uniform distribution it is known theoretically
that the mid-range is the best linear unbiased estimator for location.
However, in many applications, the most appropriate estimator will not
be known or it will be mathematically intractable to determine a valid
condfidence interval. The bootstrap provides a method for determining
SUMMARY
***********************************************************************
* LOCATION MEASURES * DISPERSION MEASURES *
***********************************************************************
* MIDRANGE = 0.4997850E+00 * RANGE = 0.9945900E+00 *
* MEAN = 0.5078304E+00 * STAND. DEV. = 0.2943252E+00 *
* MIDMEAN = 0.5045621E+00 * AV. AB. DEV. = 0.2526468E+00 *
* MEDIAN = 0.5183650E+00 * MINIMUM = 0.2490000E-02 *
* = * LOWER QUART. = 0.2508093E+00 *
* = * LOWER HINGE = 0.2505935E+00 *
* = * UPPER HINGE = 0.7594775E+00 *
* = * UPPER QUART. = 0.7591152E+00 *
* = * MAXIMUM = 0.9970800E+00 *
***********************************************************************
* RANDOMNESS MEASURES * DISTRIBUTIONAL MEASURES *
***********************************************************************
* AUTOCO COEF = -0.3098569E-01 * ST. 3RD MOM. = -0.3443941E-01 *
* = 0.0000000E+00 * ST. 4TH MOM. = 0.1796969E+01 *
* = 0.0000000E+00 * ST. WILK-SHA = -0.2004886E+02 *
* = * UNIFORM PPCC = 0.9995682E+00 *
* = * NORMAL PPCC = 0.9771602E+00 *
* = * TUK -.5 PPCC = 0.7229201E+00 *
* = * CAUCHY PPCC = 0.3591767E+00 *
***********************************************************************
Note that under the distributional measures the uniform probability plot correlation
coefficient (PPCC) value is significantly larger than the normal PPCC value. This is
evidence that the uniform distribution fits these data better than does a normal distribution.
Location One way to quantify a change in location over time is to fit a straight line to the data set
using the index variable X = 1, 2, ..., N, with N denoting the number of observations. If
there is no significant drift in the location, the slope parameter should be zero. For this data
set, Dataplot generated the following output:
The slope parameter, A1, has a t value of -0.66 which is statistically not significant. This
indicates that the slope can in fact be considered zero.
Variation One simple way to detect a change in variation is with a Bartlett test after dividing the data
set into several equal-sized intervals. However, the Bartlett test is not robust for
non-normality. Since we know this data set is not approximated well by the normal
distribution, we use the alternative Levene test. In partiuclar, we use the Levene test based
on the median rather the mean. The choice of the number of intervals is somewhat arbitrary,
although values of 4 or 8 are reasonable. Dataplot generated the following output for the
Levene test.
1. STATISTICS
NUMBER OF OBSERVATIONS = 500
NUMBER OF GROUPS = 4
LEVENE F TEST STATISTIC = 0.7983007E-01
In this case, the Levene test indicates that the standard deviations are not significantly
different in the 4 intervals.
Randomness
There are many ways in which data can be non-random. However, most common forms of
non-randomness can be detected with a few simple tests. The lag plot in the 4-plot in the
previous section is a simple graphical technique.
Another check is an autocorrelation plot that shows the autocorrelations for various lags.
Confidence bands can be plotted using 95% and 99% confidence levels. Points outside this
band indicate statistically significant values (lag 0 is always 1). Dataplot generated the
following autocorrelation plot.
The lag 1 autocorrelation, which is generally the one of most interest, is 0.03. The critical
values at the 5% significance level are -0.087 and 0.087. This indicates that the lag 1
autocorrelation is not statistically significant, so there is no evidence of non-randomness.
A common test for randomness is the runs test.
RUNS UP
STATISTIC = NUMBER OF RUNS UP
OF LENGTH EXACTLY I
I STAT EXP(STAT) SD(STAT) Z
Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically
significant at the 5% level. This runs test does not indicate any significant non-randomness.
There is a statistically significant value for runs of length 7. However, further examination
of the table shows that there is in fact a single run of length 7 when near 0 are expected.
This is not sufficient evidence to conclude that the data are non-random.
Distributional Probability plots are a graphical test of assessing whether a particular distribution provides
Analysis an adequate fit to a data set.
A quantitative enhancement to the probability plot is the correlation coefficient of the points
on the probability plot. For this data set the correlation coefficient, from the summary table
above, is 0.977. Since this is less than the critical value of 0.987 (this is a tabulated value),
the normality assumption is rejected.
Chi-square and Kolmogorov-Smirnov goodness-of-fit tests are alternative methods for
assessing distributional adequacy. The Wilk-Shapiro and Anderson-Darling tests can be
used to test for normality. Dataplot generates the following output for the Anderson-Darling
normality test.
1. STATISTICS:
NUMBER OF OBSERVATIONS = 500
MEAN = 0.5078304
STANDARD DEVIATION = 0.2943252
2. CRITICAL VALUES:
90 % POINT = 0.6560000
95 % POINT = 0.7870000
97.5 % POINT = 0.9180000
99 % POINT = 1.092000
Model Based on the graphical and quantitative analysis, we use the model
Yi = C + Ei
where C is estimated by the mid-range and the uncertainty interval for C is based on a
bootstrap analysis. Specifically,
C = 0.499
95% confidence limit for C = (0.497,0.503)
Univariate It is sometimes useful and convenient to summarize the above results in a report. The report
Report for the 500 uniform random numbers follows.
2: Location
Mean = 0.50783
Standard Deviation of Mean = 0.013163
95% Confidence Interval for Mean = (0.48197,0.533692)
Drift with respect to location? = NO
3: Variation
Standard Deviation = 0.294326
95% Confidence Interval for SD = (0.277144,0.313796)
Drift with respect to variation?
(based on Levene's test on quarters
of the data) = NO
4: Distribution
Normal PPCC = 0.999569
Data are Normal?
(as measured by Normal PPCC) = NO
5: Randomness
Autocorrelation = -0.03099
Data are Random?
(as measured by autocorrelation) = YES
6: Statistical Control
(i.e., no drift in location or scale,
data is random, distribution is
fixed, here we are testing only for
fixed uniform)
Data Set is in Statistical Control? = YES
Click on the links below to start Dataplot and run this case study
yourself. Each step may use results from previous steps, so please be The links in this column will connect you with more detailed
patient. Wait until the software verifies that the current step is information about each analysis step from the case study description.
complete before clicking on the next step.
1. Generate a run sequence plot. 1. The run sequence plot indicates that
there are no shifts of location or
scale.
2. Generate a lag plot. 2. The lag plot does not indicate any
significant patterns (which would
show the data were not random).
5. Check for normality by computing the 5. The uniform probability plot correlation
normal probability plot correlation coefficient is 0.9995. This indicates that
coefficient. the uniform distribution is a good fit.
Resulting The following is the set of random walk numbers used for this case
Data study.
-0.399027
-0.645651
-0.625516
-0.262049
-0.407173
-0.097583
0.314156
0.106905
-0.017675
-0.037111
0.357631
0.820111
0.844148
0.550509
0.090709
0.413625
-0.002149
0.393170
0.538263
0.070583
0.473143
0.132676
0.109111
-0.310553
0.179637
-0.067454
-0.190747
-0.536916
-0.905751
-0.518984
-0.579280
-0.643004
-1.014925
-0.517845
-0.860484
-0.884081
-1.147428
-0.657917
-0.470205
-0.798437
-0.637780
-0.666046
-1.093278
-1.089609
-0.853439
-0.695306
-0.206795
-0.507504
-0.696903
-1.116358
-1.044534
-1.481004
-1.638390
-1.270400
-1.026477
-1.123380
-0.770683
-0.510481
-0.958825
-0.531959
-0.457141
-0.226603
-0.201885
-0.078000
0.057733
-0.228762
-0.403292
-0.414237
-0.556689
-0.772007
-0.401024
-0.409768
-0.171804
-0.096501
-0.066854
0.216726
0.551008
0.660360
0.194795
-0.031321
0.453880
0.730594
1.136280
0.708490
1.149048
1.258757
1.102107
1.102846
0.720896
0.764035
1.072312
0.897384
0.965632
0.759684
0.679836
0.955514
1.290043
1.753449
1.542429
1.873803
2.043881
1.728635
1.289703
1.501481
1.888335
1.408421
1.416005
0.929681
1.097632
1.501279
1.650608
1.759718
2.255664
2.490551
2.508200
2.707382
2.816310
3.254166
2.890989
2.869330
3.024141
3.291558
3.260067
3.265871
3.542845
3.773240
3.991880
3.710045
4.011288
4.074805
4.301885
3.956416
4.278790
3.989947
4.315261
4.200798
4.444307
4.926084
4.828856
4.473179
4.573389
4.528605
4.452401
4.238427
4.437589
4.617955
4.370246
4.353939
4.541142
4.807353
4.706447
4.607011
4.205943
3.756457
3.482142
3.126784
3.383572
3.846550
4.228803
4.110948
4.525939
4.478307
4.457582
4.822199
4.605752
5.053262
5.545598
5.134798
5.438168
5.397993
5.838361
5.925389
6.159525
6.190928
6.024970
5.575793
5.516840
5.211826
4.869306
4.912601
5.339177
5.415182
5.003303
4.725367
4.350873
4.225085
3.825104
3.726391
3.301088
3.767535
4.211463
4.418722
4.554786
4.987701
4.993045
5.337067
5.789629
5.726147
5.934353
5.641670
5.753639
5.298265
5.255743
5.500935
5.434664
5.588610
6.047952
6.130557
5.785299
5.811995
5.582793
5.618730
5.902576
6.226537
5.738371
5.449965
5.895537
6.252904
6.650447
7.025909
6.770340
7.182244
6.941536
7.368996
7.293807
7.415205
7.259291
6.970976
7.319743
6.850454
6.556378
6.757845
6.493083
6.824855
6.533753
6.410646
6.502063
6.264585
6.730889
6.753715
6.298649
6.048126
5.794463
5.539049
5.290072
5.409699
5.843266
5.680389
5.185889
5.451353
5.003233
5.102844
5.566741
5.613668
5.352791
5.140087
4.999718
5.030444
5.428537
5.471872
5.107334
5.387078
4.889569
4.492962
4.591042
4.930187
4.857455
4.785815
5.235515
4.865727
4.855005
4.920206
4.880794
4.904395
4.795317
5.163044
4.807122
5.246230
5.111000
5.228429
5.050220
4.610006
4.489258
4.399814
4.606821
4.974252
5.190037
5.084155
5.276501
4.917121
4.534573
4.076168
4.236168
3.923607
3.666004
3.284967
2.980621
2.623622
2.882375
3.176416
3.598001
3.764744
3.945428
4.408280
4.359831
4.353650
4.329722
4.294088
4.588631
4.679111
4.182430
4.509125
4.957768
4.657204
4.325313
4.338800
4.720353
4.235756
4.281361
3.795872
4.276734
4.259379
3.999663
3.544163
3.953058
3.844006
3.684740
3.626058
3.457909
3.581150
4.022659
4.021602
4.070183
4.457137
4.156574
4.205304
4.514814
4.055510
3.938217
4.180232
3.803619
3.553781
3.583675
3.708286
4.005810
4.419880
4.881163
5.348149
4.950740
5.199262
4.753162
4.640757
4.327090
4.080888
3.725953
3.939054
3.463728
3.018284
2.661061
3.099980
3.340274
3.230551
3.287873
3.497652
3.014771
3.040046
3.342226
3.656743
3.698527
3.759707
4.253078
4.183611
4.196580
4.257851
4.683387
4.224290
3.840934
4.329286
3.909134
3.685072
3.356611
2.956344
2.800432
2.761665
2.744913
3.037743
2.787390
2.387619
2.424489
2.247564
2.502179
2.022278
2.213027
2.126914
2.264833
2.528391
2.432792
2.037974
1.699475
2.048244
1.640126
1.149858
1.475253
1.245675
0.831979
1.165877
1.403341
1.181921
1.582379
1.632130
2.113636
2.163129
2.545126
2.963833
3.078901
3.055547
3.287442
2.808189
2.985451
3.181679
2.746144
2.517390
2.719231
2.581058
2.838745
2.987765
3.459642
3.458684
3.870956
4.324706
4.411899
4.735330
4.775494
4.681160
4.462470
3.992538
3.719936
3.427081
3.256588
3.462766
3.046353
3.537430
3.579857
3.931223
3.590096
3.136285
3.391616
3.114700
2.897760
2.724241
2.557346
2.971397
2.479290
2.305336
1.852930
1.471948
1.510356
1.633737
1.727873
1.512994
1.603284
1.387950
1.767527
2.029734
2.447309
2.321470
2.435092
2.630118
2.520330
2.578147
2.729630
2.713100
3.107260
2.876659
2.774242
3.185503
3.403148
3.392646
3.123339
3.164713
3.439843
3.321929
3.686229
3.203069
3.185843
3.204924
3.102996
3.496552
3.191575
3.409044
3.888246
4.273767
3.803540
4.046417
4.071581
3.916256
3.634441
4.065834
3.844651
3.915219
is appropriate and valid, with s denoting the standard deviation of the original data.
4-Plot of Data
When the randomness assumption is seriously violated, a time series model may be
appropriate. The lag plot often suggests a reasonable model. For example, in this case the
strongly linear appearance of the lag plot suggests a model fitting Yi versus Yi-1 might be
appropriate. When the data are non-random, it is helpful to supplement the lag plot with an
autocorrelation plot and a spectral plot. Although in this case the lag plot is enough to suggest
an appropriate model, we provide the autocorrelation and spectral plots for comparison.
Autocorrelation When the lag plot indicates significant non-randomness, it can be helpful to follow up with a
Plot an autocorrelation plot.
This autocorrelation plot shows significant autocorrelation at lags 1 through 100 in a linearly
decreasing fashion.
Spectral Plot Another useful plot for non-random data is the spectral plot.
Quantitative Although the 4-plot above clearly shows the violation of the assumptions, we supplement the
Output graphical output with some quantitative measures.
Summary As a first step in the analysis, a table of summary statistics is computed from the data. The
Statistics following table, generated by Dataplot, shows a typical set of statistics.
SUMMARY
***********************************************************************
* LOCATION MEASURES * DISPERSION MEASURES *
***********************************************************************
* MIDRANGE = 0.2888407E+01 * RANGE = 0.9053595E+01 *
* MEAN = 0.3216681E+01 * STAND. DEV. = 0.2078675E+01 *
* MIDMEAN = 0.4791331E+01 * AV. AB. DEV. = 0.1660585E+01 *
* MEDIAN = 0.3612030E+01 * MINIMUM = -0.1638390E+01 *
* = * LOWER QUART. = 0.1747245E+01 *
* = * LOWER HINGE = 0.1741042E+01 *
* = * UPPER HINGE = 0.4682273E+01 *
* = * UPPER QUART. = 0.4681717E+01 *
* = * MAXIMUM = 0.7415205E+01 *
***********************************************************************
* RANDOMNESS MEASURES * DISTRIBUTIONAL MEASURES *
***********************************************************************
* AUTOCO COEF = 0.9868608E+00 * ST. 3RD MOM. = -0.4448926E+00 *
* = 0.0000000E+00 * ST. 4TH MOM. = 0.2397789E+01 *
* = 0.0000000E+00 * ST. WILK-SHA = -0.1279870E+02 *
* = * UNIFORM PPCC = 0.9765666E+00 *
* = * NORMAL PPCC = 0.9811183E+00 *
* = * TUK -.5 PPCC = 0.7754489E+00 *
* = * CAUCHY PPCC = 0.4165502E+00 *
***********************************************************************
The value of the autocorrelation statistic, 0.987, is evidence of a very strong autocorrelation.
Location One way to quantify a change in location over time is to fit a straight line to the data set using
the index variable X = 1, 2, ..., N, with N denoting the number of observations. If there is no
significant drift in the location, the slope parameter should be zero. For this data set, Dataplot
generates the following output:
Variation One simple way to detect a change in variation is with a Bartlett test after dividing the data set
into several equal-sized intervals. However, the Bartlett test is not robust for non-normality.
Since we know this data set is not approximated well by the normal distribution, we use the
alternative Levene test. In partiuclar, we use the Levene test based on the median rather the
mean. The choice of the number of intervals is somewhat arbitrary, although values of 4 or 8
are reasonable. Dataplot generated the following output for the Levene test.
1. STATISTICS
NUMBER OF OBSERVATIONS = 500
NUMBER OF GROUPS = 4
LEVENE F TEST STATISTIC = 10.45940
In this case, the Levene test indicates that the standard deviations are significantly different in
the 4 intervals since the test statistic of 10.46 is greater than the 95% critical value of 2.62.
Therefore we conclude that the scale is not constant.
Randomness Although the lag 1 autocorrelation coefficient above clearly shows the non-randomness, we
show the output from a runs test as well.
RUNS UP
RUNS DOWN
Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically
significant at the 5% level. Numerous values in this column are much larger than +/-1.96, so
we conclude that the data are not random.
Distributional Since the quantitative tests show that the assumptions of randomness and constant location and
Assumptions scale are not met, the distributional measures will not be meaningful. Therefore these
quantitative tests are omitted.
The slope parameter, A1, has a t value of 156.4 which is statistically significant. Also, the
residual standard deviation is 0.29. This can be compared to the standard deviation shown in
the summary table, which is 2.08. That is, the fit to the autoregressive model has reduced the
variability by a factor of 7.
Time This model is an example of a time series model. More extensive discussion of time series is
Series given in the Process Monitoring chapter.
Model
Test In addition to the plot of the predicted values, the residual standard
Underlying deviation from the fit also indicates a significant improvement for the
Assumptions model. The next step is to validate the underlying assumptions for the
on the error component, or residuals, from this model.
Residuals
4-Plot of
Residuals
Uniform
Probability
Plot of
Residuals
Since the uniform probability plot is nearly linear, this verifies that a
uniform distribution is a good model for the error component.
Conclusions Since the residuals from our model satisfy the underlying assumptions,
we conlude that
where the Ei follow a uniform distribution is a good model for this data
set. We could simplify this model to
This has the advantage of simplicity (the current point is simply the
previous point plus a uniformly distributed error term).
Using In this case, the above model makes sense based on our definition of
Scientific and the random walk. That is, a random walk is the cumulative sum of
Engineering uniformly distributed data points. It makes sense that modeling the
Knowledge current point as the previous point plus a uniformly distributed error
term is about as good as we can do. Although this case is a bit artificial
in that we knew how the data were constructed, it is common and
desirable to use scientific and engineering knowledge of the process
that generated the data in formulating and testing models for the data.
Quite often, several competing models will produce nearly equivalent
mathematical results. In this case, selecting the model that best
approximates the scientific understanding of the process is a reasonable
choice.
Time Series This model is an example of a time series model. More extensive
Model discussion of time series is given in the Process Monitoring chapter.
Click on the links below to start Dataplot and run this case
The links in this column will connect you with more detailed
study yourself. Each step may use results from previous steps,
information about each analysis step from the case study
so please be patient. Wait until the software verifies that the
description.
current step is complete before clicking on the next step.
2. Validate assumptions.
4. Fit Yi = A0 + A1*Yi-1 + Ei
and validate.
2. Plot fitted line with original data. 2. The plot of the predicted values with
the original data indicates a good fit.
3. Generate a 4-plot of the residuals 3. The 4-plot indicates that the assumptions
from the fit. of constant location and scale are valid.
The lag plot indicates that the data are
random. However, the histogram and normal
probability plot indicate that the uniform
disribution might be a better model for
the residuals than the normal
distribution.
Motivation The motivation for studying this data set is to illustrate the case where
there is discreteness in the measurements, but the underlying
assumptions hold. In this case, the discreteness is due to the data being
integers.
Resulting The following are the data used for this case study.
Data
2899 2898 2898 2900 2898
2901 2899 2901 2900 2898
2898 2898 2898 2900 2898
2897 2899 2897 2899 2899
2900 2897 2900 2900 2899
2898 2898 2899 2899 2899
2899 2899 2898 2899 2899
2899 2902 2899 2900 2898
2899 2899 2899 2899 2899
2899 2900 2899 2900 2898
2901 2900 2899 2899 2899
2899 2899 2900 2899 2898
2898 2898 2900 2896 2897
4-Plot of
Data
Run
Sequence
Plot
Lag Plot
Histogram
(with
overlaid
Normal PDF)
Normal
Probability
Plot
SUMMARY
***********************************************************************
* LOCATION MEASURES * DISPERSION MEASURES *
***********************************************************************
* MIDRANGE = 0.2899000E+04 * RANGE = 0.6000000E+01 *
* MEAN = 0.2898721E+04 * STAND. DEV. = 0.1235377E+01 *
* MIDMEAN = 0.2898457E+04 * AV. AB. DEV. = 0.9642857E+00 *
* MEDIAN = 0.2899000E+04 * MINIMUM = 0.2896000E+04 *
* = * LOWER QUART. = 0.2898000E+04 *
* = * LOWER HINGE = 0.2898000E+04 *
* = * UPPER HINGE = 0.2899500E+04 *
* = * UPPER QUART. = 0.2899250E+04 *
* = * MAXIMUM = 0.2902000E+04 *
***********************************************************************
* RANDOMNESS MEASURES * DISTRIBUTIONAL MEASURES *
***********************************************************************
* AUTOCO COEF = 0.2925397E+00 * ST. 3RD MOM. = 0.1271097E+00 *
* = 0.0000000E+00 * ST. 4TH MOM. = 0.2571418E+01 *
* = 0.0000000E+00 * ST. WILK-SHA = -0.3911592E+01 *
* = * UNIFORM PPCC = 0.9580541E+00 *
* = * NORMAL PPCC = 0.9701443E+00 *
* = * TUK -.5 PPCC = 0.8550686E+00 *
* = * CAUCHY PPCC = 0.6239791E+00 *
***********************************************************************
Location One way to quantify a change in location over time is to fit a straight line to the data set using
the index variable X = 1, 2, ..., N, with N denoting the number of observations. If there is no
significant drift in the location, the slope parameter should be zero. For this data set, Dataplot
generates the following output:
Variation One simple way to detect a change in variation is with a Bartlett test after dividing the data set
into several equal-sized intervals. However, the Bartlett test is not robust for non-normality.
Since the nature of the data (a few distinct points repeated many times) makes the normality
assumption questionable, we use the alternative Levene test. In partiuclar, we use the Levene
test based on the median rather the mean. The choice of the number of intervals is somewhat
arbitrary, although values of 4 or 8 are reasonable. Dataplot generated the following output for
the Levene test.
1. STATISTICS
NUMBER OF OBSERVATIONS = 140
NUMBER OF GROUPS = 4
LEVENE F TEST STATISTIC = 0.4128718
Randomness There are many ways in which data can be non-random. However, most common forms of
non-randomness can be detected with a few simple tests. The lag plot in the previous section is
a simple graphical technique.
Another check is an autocorrelation plot that shows the autocorrelations for various lags.
Confidence bands can be plotted at the 95% and 99% confidence levels. Points outside this
band indicate statistically significant values (lag 0 is always 1). Dataplot generated the
following autocorrelation plot.
The lag 1 autocorrelation, which is generally the one of most interest, is 0.29. The critical
values at the 5% level of significance are -0.087 and 0.087. This indicates that the lag 1
autocorrelation is statistically significant, so there is some evidence for non-randomness.
A common test for randomness is the runs test.
RUNS UP
RUNS DOWN
Distributional Probability plots are a graphical test for assessing if a particular distribution provides an
Analysis adequate fit to a data set.
A quantitative enhancement to the probability plot is the correlation coefficient of the points on
the probability plot. For this data set the correlation coefficient is 0.970. Since this is less than
the critical value of 0.987 (this is a tabulated value), the normality assumption is rejected.
Chi-square and Kolmogorov-Smirnov goodness-of-fit tests are alternative methods for
assessing distributional adequacy. The Wilk-Shapiro and Anderson-Darling tests can be used to
test for normality. Dataplot generates the following output for the Anderson-Darling normality
test.
1. STATISTICS:
NUMBER OF OBSERVATIONS = 140
MEAN = 2898.721
STANDARD DEVIATION = 1.235377
Outlier A test for outliers is the Grubbs test. Dataplot generated the following output for Grubbs' test.
Analysis
GRUBBS TEST FOR OUTLIERS
(ASSUMPTION: NORMALITY)
1. STATISTICS:
NUMBER OF OBSERVATIONS = 140
MINIMUM = 2896.000
MEAN = 2898.721
MAXIMUM = 2902.000
STANDARD DEVIATION = 1.235377
Model Although the randomness and normality assumptions were mildly violated, we conclude that a
reasonable model for the data is:
Univariate It is sometimes useful and convenient to summarize the above results in a report.
Report
Analysis for Josephson Junction Cryothermometry Data
2: Location
Mean = 2898.722
Standard Deviation of Mean = 0.104409
95% Confidence Interval for Mean = (2898.515,2898.928)
Drift with respect to location? = YES
(Further analysis indicates that
the drift, while statistically
significant, is not practically
significant)
3: Variation
Standard Deviation = 1.235377
95% Confidence Interval for SD = (1.105655,1.399859)
Drift with respect to variation?
(based on Levene's test on quarters
of the data) = NO
4: Distribution
Normal PPCC = 0.970145
Data are Normal?
(as measured by Normal PPCC) = NO
5: Randomness
Autocorrelation = 0.29254
Data are Random?
(as measured by autocorrelation) = NO
6: Statistical Control
(i.e., no drift in location or scale,
data are random, distribution is
fixed, here we are testing only for
fixed normal)
Data Set is in Statistical Control? = NO
7: Outliers?
(as determined by Grubbs test) = NO
Click on the links below to start Dataplot and run this case study
yourself. Each step may use results from previous steps, so please be The links in this column will connect you with more detailed
patient. Wait until the software verifies that the current step is information about each analysis step from the case study description.
complete before clicking on the next step.
1. Generate a run sequence plot. 1. The run sequence plot indicates that
there are no shifts of location or
scale.
2. Generate a lag plot. 2. The lag plot does not indicate any
significant patterns (which would
show the data were not random).
5. Check for normality by computing the 5. The normal probability plot correlation
normal probability plot correlation coefficient is 0.970. At the 5% level,
coefficient. we reject the normality assumption.
6. Check for outliers using Grubbs' test. 6. Grubbs' test detects no outliers at the
5% level.
Resulting The following are the data used for this case study.
Data
-213
-564
-35
-15
141
115
-420
-360
203
-338
-431
194
-220
-513
154
-125
-559
92
-21
-579
-52
99
-543
-175
162
-457
-346
204
-300
-474
164
-107
-572
-8
83
-541
-224
180
-420
-374
201
-236
-531
83
27
-564
-112
131
-507
-254
199
-311
-495
143
-46
-579
-90
136
-472
-338
202
-287
-477
169
-124
-568
17
48
-568
-135
162
-430
-422
172
-74
-577
-13
92
-534
-243
194
-355
-465
156
-81
-578
-64
139
-449
-384
193
-198
-538
110
-44
-577
-6
66
-552
-164
161
-460
-344
205
-281
-504
134
-28
-576
-118
156
-437
-381
200
-220
-540
83
11
-568
-160
172
-414
-408
188
-125
-572
-32
139
-492
-321
205
-262
-504
142
-83
-574
0
48
-571
-106
137
-501
-266
190
-391
-406
194
-186
-553
83
-13
-577
-49
103
-515
-280
201
300
-506
131
-45
-578
-80
138
-462
-361
201
-211
-554
32
74
-533
-235
187
-372
-442
182
-147
-566
25
68
-535
-244
194
-351
-463
174
-125
-570
15
72
-550
-190
172
-424
-385
198
-218
-536
96
is appropriate and valid where s is the standard deviation of the original data.
4-Plot of Data
is not appropriate.
We need to develop a better model. Non-random data can frequently be modeled using time
series mehtodology. Specifically, the circular pattern in the lag plot indicates that a sinusoidal
model might be appropriate. The sinusoidal model will be developed in the next section.
Individual The plots can be generated individually for more detail. In this case, only the run sequence plot
Plots and the lag plot are drawn since the distributional plots are not meaningful.
Run Sequence
Plot
Lag Plot
We have drawn some lines and boxes on the plot to better isolate the outliers. The following
output helps identify the points that are generating the outliers on the lag plot.
****************************************************
** print y index xplot yplot subset yplot > 250 **
****************************************************
****************************************************
** print y index xplot yplot subset xplot > 250 **
****************************************************
********************************************************
** print y index xplot yplot subset yplot -100 to 0
subset xplot -100 to 0 **
********************************************************
*********************************************************
** print y index xplot yplot subset yplot 100 to 200
subset xplot 100 to 200 **
*********************************************************
That is, the third, fifth, and 158th points appear to be outliers.
Autocorrelation When the lag plot indicates significant non-randomness, it can be helpful to follow up with a an
Plot autocorrelation plot.
This autocorrelation plot shows a distinct cyclic pattern. As with the lag plot, this suggests a
sinusoidal model.
Spectral Plot Another useful plot for non-random data is the spectral plot.
This spectral plot shows a single dominant peak at a frequency of 0.3. This frequency of 0.3
will be used in fitting the sinusoidal model in the next section.
Quantitative Although the lag plot, autocorrelation plot, and spectral plot clearly show the violation of the
Output randomness assumption, we supplement the graphical output with some quantitative measures.
Summary As a first step in the analysis, a table of summary statistics is computed from the data. The
Statistics following table, generated by Dataplot, shows a typical set of statistics.
SUMMARY
***********************************************************************
* LOCATION MEASURES * DISPERSION MEASURES *
***********************************************************************
* MIDRANGE = -0.1395000E+03 * RANGE = 0.8790000E+03 *
* MEAN = -0.1774350E+03 * STAND. DEV. = 0.2773322E+03 *
* MIDMEAN = -0.1797600E+03 * AV. AB. DEV. = 0.2492250E+03 *
* MEDIAN = -0.1620000E+03 * MINIMUM = -0.5790000E+03 *
* = * LOWER QUART. = -0.4510000E+03 *
* = * LOWER HINGE = -0.4530000E+03 *
* = * UPPER HINGE = 0.9400000E+02 *
* = * UPPER QUART. = 0.9300000E+02 *
* = * MAXIMUM = 0.3000000E+03 *
***********************************************************************
* RANDOMNESS MEASURES * DISTRIBUTIONAL MEASURES *
***********************************************************************
* AUTOCO COEF = -0.3073048E+00 * ST. 3RD MOM. = -0.5010057E-01 *
* = 0.0000000E+00 * ST. 4TH MOM. = 0.1503684E+01 *
* = 0.0000000E+00 * ST. WILK-SHA = -0.1883372E+02 *
* = * UNIFORM PPCC = 0.9925535E+00 *
* = * NORMAL PPCC = 0.9540811E+00 *
* = * TUK -.5 PPCC = 0.7313794E+00 *
* = * CAUCHY PPCC = 0.4408355E+00 *
***********************************************************************
Location One way to quantify a change in location over time is to fit a straight line to the data set using
the index variable X = 1, 2, ..., N, with N denoting the number of observations. If there is no
significant drift in the location, the slope parameter should be zero. For this data set, Dataplot
generates the following output:
Variation One simple way to detect a change in variation is with a Bartlett test after dividing the data set
into several equal-sized intervals. However, the Bartlett the non-randomness of this data does
not allows us to assume normality, we use the alternative Levene test. In partiuclar, we use the
Levene test based on the median rather the mean. The choice of the number of intervals is
somewhat arbitrary, although values of 4 or 8 are reasonable. Dataplot generated the following
output for the Levene test.
1. STATISTICS
NUMBER OF OBSERVATIONS = 200
NUMBER OF GROUPS = 4
LEVENE F TEST STATISTIC = 0.9378599E-01
RUNS UP
RUNS DOWN
Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically significant
at the 5% level. Numerous values in this column are much larger than +/-1.96, so we conclude
that the data are not random.
Distributional Since the quantitative tests show that the assumptions of constant scale and non-randomness are
Assumptions not met, the distributional measures will not be meaningful. Therefore these quantitative tests
are omitted.
To obtain a good fit, sinusoidal models require good starting values for C, the
amplitude, and the frequency.
Good Starting A good starting value for C can be obtained by calculating the mean of the data.
Value for C If the data show a trend, i.e., the assumption of constant location is violated, we
can replace C with a linear or quadratic least squares fit. That is, the model
becomes
or
Since our data did not have any meaningful change of location, we can fit the
simpler model with C equal to the mean. From the summary output in the
previous page, the mean is -177.44.
Good Starting The starting value for the frequency can be obtained from the spectral plot,
Value for which shows the dominant frequency is about 0.3.
Frequency
Complex The complex demodulation phase plot can be used to refine this initial estimate
Demodulation for the frequency.
Phase Plot
For the complex demodulation plot, if the lines slope from left to right, the
frequency should be increased. If the lines slope from right to left, it should be
decreased. A relatively flat (i.e., horizontal) slope indicates a good frequency.
We could generate the demodulation phase plot for 0.3 and then use trial and
error to obtain a better estimate for the frequency. To simplify this, we generate
16 of these plots on a single page starting with a frequency of 0.28, increasing in
increments of 0.0025, and stopping at 0.3175.
Interpretation The plots start with lines sloping from left to right but gradually change to a right
to left slope. The relatively flat slope occurs for frequency 0.3025 (third row,
second column). The complex demodulation phase plot restricts the range from
to . This is why the plot appears to show some breaks.
Good Starting The complex demodulation amplitude plot is used to find a good starting value
Values for for the amplitude. In addition, this plot indicates whether or not the amplitude is
Amplitude constant over the entire range of the data or if it varies. If the plot is essentially
flat, i.e., zero slope, then it is reasonable to assume a constant amplitude in the
non-linear model. However, if the slope varies over the range of the plot, we
may need to adjust the model to be:
That is, we replace with a function of time. A linear fit is specified in the
model above, but this can be replaced with a more elaborate function if needed.
Complex
Demodulation
Amplitude
Plot
The complex demodulation amplitude plot for this data shows that:
1. The amplitude is fixed at approximately 390.
2. There is a short start-up effect.
3. There is a change in amplitude at around x=160 that should be
investigated for an outlier.
In terms of a non-linear model, the plot indicates that fitting a single constant for
should be adequate for this data set.
Fit Output Using starting estimates of 0.3025 for the frequency, 390 for the amplitude, and
-177.44 for C, Dataplot generated the following output for the fit.
ITERATION CONVERGENCE
RESIDUAL * PARAMETER
NUMBER MEASURE
STANDARD * ESTIMATES
DEVIATION *
----------------------------------*-----------
1-- 0.10000E-01 0.52903E+03 *-0.17743E+03 0.39000E+03 0.30250E+00 0.10000E+01
2-- 0.50000E-02 0.22218E+03 *-0.17876E+03-0.33137E+03 0.30238E+00 0.71471E+00
3-- 0.25000E-02 0.15634E+03 *-0.17886E+03-0.24523E+03 0.30233E+00 0.14022E+01
4-- 0.96108E-01 0.15585E+03 *-0.17879E+03-0.36177E+03 0.30260E+00 0.14654E+01
Fit Output Dataplot generated the following fit output after removing 3 outliers.
with Outliers
Removed
ITERATION CONVERGENCE
RESIDUAL * PARAMETER
NUMBER MEASURE
STANDARD * ESTIMATES
DEVIATION *
----------------------------------*-----------
1-- 0.10000E-01 0.14834E+03 *-0.17879E+03-0.36177E+03 0.30260E+00 0.14654E+01
New The original fit, with a residual standard deviation of 155.84, was:
Fit to
Edited
Data The new fit, with a residual standard deviation of 148.34, is:
4-Plot
for
New
Fit
This plot shows that the underlying assumptions are satisfied and therefore the
new fit is a good descriptor of the data.
In this case, it is a judgment call whether to use the fit with or without the
outliers removed.
Click on the links below to start Dataplot and run this case
study yourself. Each step may use results from previous steps, The links in this column will connect you with more detailed
so please be patient. Wait until the software verifies that the information about each analysis step from the case study description.
current step is complete before clicking on the next step.
2. Validate assumptions.
3. Fit
Yi = C + A*SIN(2*PI*omega*ti+phi).
1. Complex demodulation phase plot
1. Generate a complex demodulation indicates a starting frequency
phase plot. of 0.3025.
4. Validate fit.
1. Generate a 4-plot of the residuals 1. The 4-plot indicates that the assumptions
from the fit. of constant location and scale are valid.
The lag plot indicates that the data are
random. The histogram and normal
probability plot indicate that the residuals
that the normality assumption for the
residuals are not seriously violated,
although there is a bend on the probablity
plot that warrants attention.
Resulting The following are the data used for this case study.
Data
2.00180
2.00170
2.00180
2.00190
2.00180
2.00170
2.00150
2.00140
2.00150
2.00150
2.00170
2.00180
2.00180
2.00190
2.00190
2.00210
2.00200
2.00160
2.00140
2.00130
2.00130
2.00150
2.00150
2.00160
2.00150
2.00140
2.00130
2.00140
2.00150
2.00140
2.00150
2.00160
2.00150
2.00160
2.00190
2.00200
2.00200
2.00210
2.00220
2.00230
2.00240
2.00250
2.00270
2.00260
2.00260
2.00260
2.00270
2.00260
2.00250
2.00240
4-Plot of
Data
is not valid. Given the linear appearance of the lag plot, the first step
might be to consider a model of the type
Run
Sequence
Plot
Lag Plot
SUMMARY
NUMBER OF OBSERVATIONS = 50
***********************************************************************
* LOCATION MEASURES * DISPERSION MEASURES *
***********************************************************************
* MIDRANGE = 0.2002000E+01 * RANGE = 0.1399994E-02 *
* MEAN = 0.2001856E+01 * STAND. DEV. = 0.4291329E-03 *
* MIDMEAN = 0.2001638E+01 * AV. AB. DEV. = 0.3480196E-03 *
* MEDIAN = 0.2001800E+01 * MINIMUM = 0.2001300E+01 *
* = * LOWER QUART. = 0.2001500E+01 *
* = * LOWER HINGE = 0.2001500E+01 *
* = * UPPER HINGE = 0.2002100E+01 *
* = * UPPER QUART. = 0.2002175E+01 *
* = * MAXIMUM = 0.2002700E+01 *
***********************************************************************
* RANDOMNESS MEASURES * DISTRIBUTIONAL MEASURES *
***********************************************************************
* AUTOCO COEF = 0.9379919E+00 * ST. 3RD MOM. = 0.6191616E+00 *
* = 0.0000000E+00 * ST. 4TH MOM. = 0.2098746E+01 *
* = 0.0000000E+00 * ST. WILK-SHA = -0.4995516E+01 *
* = * UNIFORM PPCC = 0.9666610E+00 *
* = * NORMAL PPCC = 0.9558001E+00 *
* = * TUK -.5 PPCC = 0.8462552E+00 *
* = * CAUCHY PPCC = 0.6822084E+00 *
***********************************************************************
Location One way to quantify a change in location over time is to fit a straight line to the data set using
the index variable X = 1, 2, ..., N, with N denoting the number of observations. If there is no
significant drift in the location, the slope parameter should be zero. For this data set, Dataplot
generates the following output:
Variation One simple way to detect a change in variation is with a Bartlett test after dividing the data set
into several equal sized intervals. However, the Bartlett test is not robust for non-normality.
Since the normality assumption is questionable for these data, we use the alternative Levene
test. In partiuclar, we use the Levene test based on the median rather the mean. The choice of
the number of intervals is somewhat arbitrary, although values of 4 or 8 are reasonable.
Dataplot generated the following output for the Levene test.
1. STATISTICS
NUMBER OF OBSERVATIONS = 50
NUMBER OF GROUPS = 4
LEVENE F TEST STATISTIC = 0.9714893
Randomness There are many ways in which data can be non-random. However, most common forms of
non-randomness can be detected with a few simple tests. The lag plot in the 4-plot in the
previous seciton is a simple graphical technique.
One check is an autocorrelation plot that shows the autocorrelations for various lags.
Confidence bands can be plotted at the 95% and 99% confidence levels. Points outside this
band indicate statistically significant values (lag 0 is always 1). Dataplot generated the
following autocorrelation plot.
The lag 1 autocorrelation, which is generally the one of most interest, is 0.93. The critical
values at the 5% level are -0.277 and 0.277. This indicates that the lag 1 autocorrelation is
statistically significant, so there is strong evidence of non-randomness.
A common test for randomness is the runs test.
RUNS UP
RUNS DOWN
Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically significant
at the 5% level. Due to the number of values that are much larger than the 1.96 cut-off, we
conclude that the data are not random.
Distributional Since we rejected the randomness assumption, the distributional tests are not meaningful.
Analysis Therefore, these quantitative tests are omitted. We also omit Grubbs' outlier test since it also
assumes the data are approximately normally distributed.
Univariate It is sometimes useful and convenient to summarize the above results in a report.
Report
1: Sample Size = 50
2: Location
Mean = 2.001857
Standard Deviation of Mean = 0.00006
95% Confidence Interval for Mean = (2.001735,2.001979)
Drift with respect to location? = NO
3: Variation
Standard Deviation = 0.00043
95% Confidence Interval for SD = (0.000359,0.000535)
Change in variation?
(based on Levene's test on quarters
of the data) = NO
5: Randomness
Lag One Autocorrelation = 0.937998
Data are Random?
(as measured by autocorrelation) = NO
6: Statistical Control
(i.e., no drift in location or scale,
data are random, distribution is
fixed, here we are testing only for
normal)
Data Set is in Statistical Control? = NO
7: Outliers?
(Grubbs' test omitted) = NO
Click on the links below to start Dataplot and run this case study
The links in this column will connect you with more detailed
yourself. Each step may use results from previous steps, so please
information about each analysis step from the case study
be patient. Wait until the software verifies that the current step is
description.
complete before clicking on the next step.
1. Generate a run sequence plot. 1. The run sequence plot indicates that
there is a shift in location.
2. Compute a linear fit based on 2. The linear fit indicates a slight drift in
quarters of the data to detect location since the slope parameter is
drift in location. statistically significant, but small.
Resulting The following are the data used for this case study.
Data
27.8680
27.8929
27.8773
27.8530
27.8876
27.8725
27.8743
27.8879
27.8728
27.8746
27.8863
27.8716
27.8818
27.8872
27.8885
27.8945
27.8797
27.8627
27.8870
27.8895
27.9138
27.8931
27.8852
27.8788
27.8827
27.8939
27.8558
27.8814
27.8479
27.8479
27.8848
27.8809
27.8479
27.8611
27.8630
27.8679
27.8637
27.8985
27.8900
27.8577
27.8848
27.8869
27.8976
27.8610
27.8567
27.8417
27.8280
27.8555
27.8639
27.8702
27.8582
27.8605
27.8900
27.8758
27.8774
27.9008
27.8988
27.8897
27.8990
27.8958
27.8830
27.8967
27.9105
27.9028
27.8977
27.8953
27.8970
27.9190
27.9180
27.8997
27.9204
27.9234
27.9072
27.9152
27.9091
27.8882
27.9035
27.9267
27.9138
27.8955
27.9203
27.9239
27.9199
27.9646
27.9411
27.9345
27.8712
27.9145
27.9259
27.9317
27.9239
27.9247
27.9150
27.9444
27.9457
27.9166
27.9066
27.9088
27.9255
27.9312
27.9439
27.9210
27.9102
27.9083
27.9121
27.9113
27.9091
27.9235
27.9291
27.9253
27.9092
27.9117
27.9194
27.9039
27.9515
27.9143
27.9124
27.9128
27.9260
27.9339
27.9500
27.9530
27.9430
27.9400
27.8850
27.9350
27.9120
27.9260
27.9660
27.9280
27.9450
27.9390
27.9429
27.9207
27.9205
27.9204
27.9198
27.9246
27.9366
27.9234
27.9125
27.9032
27.9285
27.9561
27.9616
27.9530
27.9280
27.9060
27.9380
27.9310
27.9347
27.9339
27.9410
27.9397
27.9472
27.9235
27.9315
27.9368
27.9403
27.9529
27.9263
27.9347
27.9371
27.9129
27.9549
27.9422
27.9423
27.9750
27.9339
27.9629
27.9587
27.9503
27.9573
27.9518
27.9527
27.9589
27.9300
27.9629
27.9630
27.9660
27.9730
27.9660
27.9630
27.9570
27.9650
27.9520
27.9820
27.9560
27.9670
27.9520
27.9470
27.9720
27.9610
27.9437
27.9660
27.9580
27.9660
27.9700
27.9600
27.9660
27.9770
27.9110
27.9690
27.9698
27.9616
27.9371
27.9700
27.9265
27.9964
27.9842
27.9667
27.9610
27.9943
27.9616
27.9397
27.9799
28.0086
27.9709
27.9741
27.9675
27.9826
27.9676
27.9703
27.9789
27.9786
27.9722
27.9831
28.0043
27.9548
27.9875
27.9495
27.9549
27.9469
27.9744
27.9744
27.9449
27.9837
27.9585
28.0096
27.9762
27.9641
27.9854
27.9877
27.9839
27.9817
27.9845
27.9877
27.9880
27.9822
27.9836
28.0030
27.9678
28.0146
27.9945
27.9805
27.9785
27.9791
27.9817
27.9805
27.9782
27.9753
27.9792
27.9704
27.9794
27.9814
27.9794
27.9795
27.9881
27.9772
27.9796
27.9736
27.9772
27.9960
27.9795
27.9779
27.9829
27.9829
27.9815
27.9811
27.9773
27.9778
27.9724
27.9756
27.9699
27.9724
27.9666
27.9666
27.9739
27.9684
27.9861
27.9901
27.9879
27.9865
27.9876
27.9814
27.9842
27.9868
27.9834
27.9892
27.9864
27.9843
27.9838
27.9847
27.9860
27.9872
27.9869
27.9602
27.9852
27.9860
27.9836
27.9813
27.9623
27.9843
27.9802
27.9863
27.9813
27.9881
27.9850
27.9850
27.9830
27.9866
27.9888
27.9841
27.9863
27.9903
27.9961
27.9905
27.9945
27.9878
27.9929
27.9914
27.9914
27.9997
28.0006
27.9999
28.0004
28.0020
28.0029
28.0008
28.0040
28.0078
28.0065
27.9959
28.0073
28.0017
28.0042
28.0036
28.0055
28.0007
28.0066
28.0011
27.9960
28.0083
27.9978
28.0108
28.0088
28.0088
28.0139
28.0092
28.0092
28.0049
28.0111
28.0120
28.0093
28.0116
28.0102
28.0139
28.0113
28.0158
28.0156
28.0137
28.0236
28.0171
28.0224
28.0184
28.0199
28.0190
28.0204
28.0170
28.0183
28.0201
28.0182
28.0183
28.0175
28.0127
28.0211
28.0057
28.0180
28.0183
28.0149
28.0185
28.0182
28.0192
28.0213
28.0216
28.0169
28.0162
28.0167
28.0167
28.0169
28.0169
28.0161
28.0152
28.0179
28.0215
28.0194
28.0115
28.0174
28.0178
28.0202
28.0240
28.0198
28.0194
28.0171
28.0134
28.0121
28.0121
28.0141
28.0101
28.0114
28.0122
28.0124
28.0171
28.0165
28.0166
28.0159
28.0181
28.0200
28.0116
28.0144
28.0141
28.0116
28.0107
28.0169
28.0105
28.0136
28.0138
28.0114
28.0122
28.0122
28.0116
28.0025
28.0097
28.0066
28.0072
28.0066
28.0068
28.0067
28.0130
28.0091
28.0088
28.0091
28.0091
28.0115
28.0087
28.0128
28.0139
28.0095
28.0115
28.0101
28.0121
28.0114
28.0121
28.0122
28.0121
28.0168
28.0212
28.0219
28.0221
28.0204
28.0169
28.0141
28.0142
28.0147
28.0159
28.0165
28.0144
28.0182
28.0155
28.0155
28.0192
28.0204
28.0185
28.0248
28.0185
28.0226
28.0271
28.0290
28.0240
28.0302
28.0243
28.0288
28.0287
28.0301
28.0273
28.0313
28.0293
28.0300
28.0344
28.0308
28.0291
28.0287
28.0358
28.0309
28.0286
28.0308
28.0291
28.0380
28.0411
28.0420
28.0359
28.0368
28.0327
28.0361
28.0334
28.0300
28.0347
28.0359
28.0344
28.0370
28.0355
28.0371
28.0318
28.0390
28.0390
28.0390
28.0376
28.0376
28.0377
28.0345
28.0333
28.0429
28.0379
28.0401
28.0401
28.0423
28.0393
28.0382
28.0424
28.0386
28.0386
28.0373
28.0397
28.0412
28.0565
28.0419
28.0456
28.0426
28.0423
28.0391
28.0403
28.0388
28.0408
28.0457
28.0455
28.0460
28.0456
28.0464
28.0442
28.0416
28.0451
28.0432
28.0434
28.0448
28.0448
28.0373
28.0429
28.0392
28.0469
28.0443
28.0356
28.0474
28.0446
28.0348
28.0368
28.0418
28.0445
28.0533
28.0439
28.0474
28.0435
28.0419
28.0538
28.0538
28.0463
28.0491
28.0441
28.0411
28.0507
28.0459
28.0519
28.0554
28.0512
28.0507
28.0582
28.0471
28.0539
28.0530
28.0502
28.0422
28.0431
28.0395
28.0177
28.0425
28.0484
28.0693
28.0490
28.0453
28.0494
28.0522
28.0393
28.0443
28.0465
28.0450
28.0539
28.0566
28.0585
28.0486
28.0427
28.0548
28.0616
28.0298
28.0726
28.0695
28.0629
28.0503
28.0493
28.0537
28.0613
28.0643
28.0678
28.0564
28.0703
28.0647
28.0579
28.0630
28.0716
28.0586
28.0607
28.0601
28.0611
28.0606
28.0611
28.0066
28.0412
28.0558
28.0590
28.0750
28.0483
28.0599
28.0490
28.0499
28.0565
28.0612
28.0634
28.0627
28.0519
28.0551
28.0696
28.0581
28.0568
28.0572
28.0529
28.0421
28.0432
28.0211
28.0363
28.0436
28.0619
28.0573
28.0499
28.0340
28.0474
28.0534
28.0589
28.0466
28.0448
28.0576
28.0558
28.0522
28.0480
28.0444
28.0429
28.0624
28.0610
28.0461
28.0564
28.0734
28.0565
28.0503
28.0581
28.0519
28.0625
28.0583
28.0645
28.0642
28.0535
28.0510
28.0542
28.0677
28.0416
28.0676
28.0596
28.0635
28.0558
28.0623
28.0718
28.0585
28.0552
28.0684
28.0646
28.0590
28.0465
28.0594
28.0303
28.0533
28.0561
28.0585
28.0497
28.0582
28.0507
28.0562
28.0715
28.0468
28.0411
28.0587
28.0456
28.0705
28.0534
28.0558
28.0536
28.0552
28.0461
28.0598
28.0598
28.0650
28.0423
28.0442
28.0449
28.0660
28.0506
28.0655
28.0512
28.0407
28.0475
28.0411
28.0512
28.1036
28.0641
28.0572
28.0700
28.0577
28.0637
28.0534
28.0461
28.0701
28.0631
28.0575
28.0444
28.0592
28.0684
28.0593
28.0677
28.0512
28.0644
28.0660
28.0542
28.0768
28.0515
28.0579
28.0538
28.0526
28.0833
28.0637
28.0529
28.0535
28.0561
28.0736
28.0635
28.0600
28.0520
28.0695
28.0608
28.0608
28.0590
28.0290
28.0939
28.0618
28.0551
28.0757
28.0698
28.0717
28.0529
28.0644
28.0613
28.0759
28.0745
28.0736
28.0611
28.0732
28.0782
28.0682
28.0756
28.0857
28.0739
28.0840
28.0862
28.0724
28.0727
28.0752
28.0732
28.0703
28.0849
28.0795
28.0902
28.0874
28.0971
28.0638
28.0877
28.0751
28.0904
28.0971
28.0661
28.0711
28.0754
28.0516
28.0961
28.0689
28.1110
28.1062
28.0726
28.1141
28.0913
28.0982
28.0703
28.0654
28.0760
28.0727
28.0850
28.0877
28.0967
28.1185
28.0945
28.0834
28.0764
28.1129
28.0797
28.0707
28.1008
28.0971
28.0826
28.0857
28.0984
28.0869
28.0795
28.0875
28.1184
28.0746
28.0816
28.0879
28.0888
28.0924
28.0979
28.0702
28.0847
28.0917
28.0834
28.0823
28.0917
28.0779
28.0852
28.0863
28.0942
28.0801
28.0817
28.0922
28.0914
28.0868
28.0832
28.0881
28.0910
28.0886
28.0961
28.0857
28.0859
28.1086
28.0838
28.0921
28.0945
28.0839
28.0877
28.0803
28.0928
28.0885
28.0940
28.0856
28.0849
28.0955
28.0955
28.0846
28.0871
28.0872
28.0917
28.0931
28.0865
28.0900
28.0915
28.0963
28.0917
28.0950
28.0898
28.0902
28.0867
28.0843
28.0939
28.0902
28.0911
28.0909
28.0949
28.0867
28.0932
28.0891
28.0932
28.0887
28.0925
28.0928
28.0883
28.0946
28.0977
28.0914
28.0959
28.0926
28.0923
28.0950
28.1006
28.0924
28.0963
28.0893
28.0956
28.0980
28.0928
28.0951
28.0958
28.0912
28.0990
28.0915
28.0957
28.0976
28.0888
28.0928
28.0910
28.0902
28.0950
28.0995
28.0965
28.0972
28.0963
28.0946
28.0942
28.0998
28.0911
28.1043
28.1002
28.0991
28.0959
28.0996
28.0926
28.1002
28.0961
28.0983
28.0997
28.0959
28.0988
28.1029
28.0989
28.1000
28.0944
28.0979
28.1005
28.1012
28.1013
28.0999
28.0991
28.1059
28.0961
28.0981
28.1045
28.1047
28.1042
28.1146
28.1113
28.1051
28.1065
28.1065
28.0985
28.1000
28.1066
28.1041
28.0954
28.1090
4-Plot of
Data
is not valid. Given the linear appearance of the lag plot, the first step
might be to consider a model of the type
data in the first and last thirds was collected in winter while the more
stable middle third was collected in the summer. The seasonal effect
was determined to be caused by the amount of humidity affecting the
measurement equipment. In this case, the solution was to modify the
test equipment to be less sensitive to enviromental factors.
Simple graphical techniques can be quite effective in revealing
unexpected results in the data. When this occurs, it is important to
investigate whether the unexpected result is due to problems in the
experiment and data collection, or is it in fact indicative of an
unexpected underlying structure in the data. This determination cannot
be made on the basis of statistics alone. The role of the graphical and
statistical analysis is to detect problems or unexpected results in the
data. Resolving the issues requires the knowledge of the scientist or
engineer.
Run
Sequence
Plot
Lag Plot
SUMMARY
***********************************************************************
* LOCATION MEASURES * DISPERSION MEASURES *
***********************************************************************
* MIDRANGE = 0.2797325E+02 * RANGE = 0.2905006E+00 *
* MEAN = 0.2801634E+02 * STAND. DEV. = 0.6349404E-01 *
* MIDMEAN = 0.2802659E+02 * AV. AB. DEV. = 0.5101655E-01 *
* MEDIAN = 0.2802910E+02 * MINIMUM = 0.2782800E+02 *
* = * LOWER QUART. = 0.2797905E+02 *
* = * LOWER HINGE = 0.2797900E+02 *
* = * UPPER HINGE = 0.2806295E+02 *
* = * UPPER QUART. = 0.2806293E+02 *
* = * MAXIMUM = 0.2811850E+02 *
***********************************************************************
* RANDOMNESS MEASURES * DISTRIBUTIONAL MEASURES *
***********************************************************************
* AUTOCO COEF = 0.9721591E+00 * ST. 3RD MOM. = -0.6936395E+00 *
* = 0.0000000E+00 * ST. 4TH MOM. = 0.2689681E+01 *
* = 0.0000000E+00 * ST. WILK-SHA = -0.4216419E+02 *
* = * UNIFORM PPCC = 0.9689648E+00 *
* = * NORMAL PPCC = 0.9718416E+00 *
* = * TUK -.5 PPCC = 0.7334843E+00 *
* = * CAUCHY PPCC = 0.3347875E+00 *
***********************************************************************
Location One way to quantify a change in location over time is to fit a straight line to the data set using
the index variable X = 1, 2, ..., N, with N denoting the number of observations. If there is no
significant drift in the location, the slope parameter estimate should be zero. For this data set,
Dataplot generates the following output:
Variation One simple way to detect a change in variation is with a Bartlett test after dividing the data set
into several equal-sized intervals. However, the Bartlett test is not robust for non-normality.
Since the normality assumption is questionable for these data, we use the alternative Levene
test. In partiuclar, we use the Levene test based on the median rather the mean. The choice of
the number of intervals is somewhat arbitrary, although values of 4 or 8 are reasonable.
Dataplot generated the following output for the Levene test.
1. STATISTICS
NUMBER OF OBSERVATIONS = 1000
NUMBER OF GROUPS = 4
LEVENE F TEST STATISTIC = 140.8509
Randomness
There are many ways in which data can be non-random. However, most common forms of
non-randomness can be detected with a few simple tests. The lag plot in the 4-plot in the
previous section is a simple graphical technique.
One check is an autocorrelation plot that shows the autocorrelations for various lags.
Confidence bands can be plotted at the 95% and 99% confidence levels. Points outside this
band indicate statistically significant values (lag 0 is always 1). Dataplot generated the
following autocorrelation plot.
The lag 1 autocorrelation, which is generally the one of greatest interest, is 0.97. The critical
values at the 5% significance level are -0.062 and 0.062. This indicates that the lag 1
autocorrelation is statistically significant, so there is strong evidence of non-randomness.
A common test for randomness is the runs test.
RUNS UP
RUNS DOWN
Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically significant
at the 5% level. Due to the number of values that are larger than the 1.96 cut-off, we conclude
that the data are not random. However, in this case the evidence from the runs test is not nearly
as strong as it is from the autocorrelation plot.
Distributional Since we rejected the randomness assumption, the distributional tests are not meaningful.
Analysis Therefore, these quantitative tests are omitted. Since the Grubbs' test for outliers also assumes
the approximate normality of the data, we omit Grubbs' test as well.
Univariate It is sometimes useful and convenient to summarize the above results in a report.
Report
2: Location
Mean = 28.01635
Standard Deviation of Mean = 0.002008
95% Confidence Interval for Mean = (28.0124,28.02029)
Drift with respect to location? = NO
3: Variation
Standard Deviation = 0.063495
95% Confidence Interval for SD = (0.060829,0.066407)
Change in variation?
(based on Levene's test on quarters
of the data) = YES
4: Randomness
Autocorrelation = 0.972158
Data Are Random?
(as measured by autocorrelation) = NO
5: Distribution
Distributional test omitted due to
non-randomness of the data
6: Statistical Control
(i.e., no drift in location or scale,
data are random, distribution is
fixed)
Data Set is in Statistical Control? = NO
7: Outliers?
(Grubbs' test omitted due to
non-randomness of the data
Click on the links below to start Dataplot and run this case study
yourself. Each step may use results from previous steps, so please be
patient. Wait until the software verifies that the current step is
complete before clicking on the next step.
The links in this column will connect you with more detailed information about
NOTE: This case study has 1,000 points. For better performance, it each analysis step from the case study description.
is highly recommended that you check the "No Update" box on the
Spreadsheet window for this case study. This will suppress
subsequent updating of the Spreadsheet window as the data are
created or modified.
1. Generate a run sequence plot. 1. The run sequence plot indicates that
there are shifts of location and
variation.
2. Generate the sample mean, a confidence 2. The mean is 28.0163 and a 95%
interval for the population mean, and confidence interval is (28.0124,28.02029).
compute a linear fit to detect drift in The linear fit indicates drift in
3. Generate the sample standard deviation, 3. The standard deviation is 0.0635 with
a confidence interval for the population a 95% confidence interval of (0.060829,0.066407).
standard deviation, and detect drift in Levene's test indicates significant
variation by dividing the data into change in variation.
quarters and computing Levene's test for
equal standard deviations.
Resulting The following are the data used for this case study.
Data
9.206343
9.299992
9.277895
9.305795
9.275351
9.288729
9.287239
9.260973
9.303111
9.275674
9.272561
9.288454
9.255672
9.252141
9.297670
9.266534
9.256689
9.277542
9.248205
9.252107
9.276345
9.278694
9.267144
9.246132
9.238479
9.269058
9.248239
9.257439
9.268481
9.288454
9.258452
9.286130
9.251479
9.257405
9.268343
9.291302
9.219460
9.270386
9.218808
9.241185
9.269989
9.226585
9.258556
9.286184
9.320067
9.327973
9.262963
9.248181
9.238644
9.225073
9.220878
9.271318
9.252072
9.281186
9.270624
9.294771
9.301821
9.278849
9.236680
9.233988
9.244687
9.221601
9.207325
9.258776
9.275708
9.268955
9.257269
9.264979
9.295500
9.292883
9.264188
9.280731
9.267336
9.300566
9.253089
9.261376
9.238409
9.225073
9.235526
9.239510
9.264487
9.244242
9.277542
9.310506
9.261594
9.259791
9.253089
9.245735
9.284058
9.251122
9.275385
9.254619
9.279526
9.275065
9.261952
9.275351
9.252433
9.230263
9.255150
9.268780
9.290389
9.274161
9.255707
9.261663
9.250455
9.261952
9.264041
9.264509
9.242114
9.239674
9.221553
9.241935
9.215265
9.285930
9.271559
9.266046
9.285299
9.268989
9.267987
9.246166
9.231304
9.240768
9.260506
9.274355
9.292376
9.271170
9.267018
9.308838
9.264153
9.278822
9.255244
9.229221
9.253158
9.256292
9.262602
9.219793
9.258452
9.267987
9.267987
9.248903
9.235153
9.242933
9.253453
9.262671
9.242536
9.260803
9.259825
9.253123
9.240803
9.238712
9.263676
9.243002
9.246826
9.252107
9.261663
9.247311
9.306055
9.237646
9.248937
9.256689
9.265777
9.299047
9.244814
9.287205
9.300566
9.256621
9.271318
9.275154
9.281834
9.253158
9.269024
9.282077
9.277507
9.284910
9.239840
9.268344
9.247778
9.225039
9.230750
9.270024
9.265095
9.284308
9.280697
9.263032
9.291851
9.252072
9.244031
9.283269
9.196848
9.231372
9.232963
9.234956
9.216746
9.274107
9.273776
4-Plot of
Data
Run
Sequence
Plot
Lag Plot
Histogram
(with
overlaid
Normal PDF)
Normal
Probability
Plot
SUMMARY
***********************************************************************
* LOCATION MEASURES * DISPERSION MEASURES *
***********************************************************************
* MIDRANGE = 0.9262411E+01 * RANGE = 0.1311255E+00 *
* MEAN = 0.9261460E+01 * STAND. DEV. = 0.2278881E-01 *
* MIDMEAN = 0.9259412E+01 * AV. AB. DEV. = 0.1788945E-01 *
* MEDIAN = 0.9261952E+01 * MINIMUM = 0.9196848E+01 *
* = * LOWER QUART. = 0.9246826E+01 *
* = * LOWER HINGE = 0.9246496E+01 *
* = * UPPER HINGE = 0.9275530E+01 *
* = * UPPER QUART. = 0.9275708E+01 *
* = * MAXIMUM = 0.9327973E+01 *
***********************************************************************
* RANDOMNESS MEASURES * DISTRIBUTIONAL MEASURES *
***********************************************************************
* AUTOCO COEF = 0.2805789E+00 * ST. 3RD MOM. = -0.8537455E-02 *
* = 0.0000000E+00 * ST. 4TH MOM. = 0.3049067E+01 *
* = 0.0000000E+00 * ST. WILK-SHA = 0.9458605E+01 *
* = * UNIFORM PPCC = 0.9735289E+00 *
* = * NORMAL PPCC = 0.9989640E+00 *
* = * TUK -.5 PPCC = 0.8927904E+00 *
* = * CAUCHY PPCC = 0.6360204E+00 *
***********************************************************************
Location One way to quantify a change in location over time is to fit a straight line to the data set using
the index variable X = 1, 2, ..., N, with N denoting the number of observations. If there is no
significant drift in the location, the slope parameter should be zero. For this data set, Dataplot
generates the following output:
Variation One simple way to detect a change in variation is with a Bartlett test after dividing the data set
into several equal-sized intervals. The choice of the number of intervals is somewhat arbitrary,
although values of 4 or 8 are reasonable. Dataplot generated the following output for the
Bartlett test.
BARTLETT TEST
(STANDARD DEFINITION)
NULL HYPOTHESIS UNDER TEST--ALL SIGMA(I) ARE EQUAL
TEST:
DEGREES OF FREEDOM = 3.000000
In this case, since the Bartlett test statistic of 3.14 is less than the critical value at the 5%
significance level of 7.81, we conclude that the standard deviations are not significantly
different in the 4 intervals. That is, the assumption of constant scale is valid.
Randomness There are many ways in which data can be non-random. However, most common forms of
non-randomness can be detected with a few simple tests. The lag plot in the previous section is
a simple graphical technique.
Another check is an autocorrelation plot that shows the autocorrelations for various lags.
Confidence bands can be plotted at the 95% and 99% confidence levels. Points outside this
band indicate statistically significant values (lag 0 is always 1). Dataplot generated the
following autocorrelation plot.
The lag 1 autocorrelation, which is generally the one of greatest interest, is 0.281. The critical
values at the 5% significance level are -0.087 and 0.087. This indicates that the lag 1
autocorrelation is statistically significant, so there is evidence of non-randomness.
A common test for randomness is the runs test.
RUNS UP
RUNS DOWN
Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically
significant at the 5% level. The runs test does indicate some non-randomness.
Although the autocorrelation plot and the runs test indicate some mild non-randomness, the
violation of the randomness assumption is not serious enough to warrant developing a more
sophisticated model. It is common in practice that some of the assumptions are mildly violated
and it is a judgement call as to whether or not the violations are serious enough to warrant
developing a more sophisticated model for the data.
Distributional Probability plots are a graphical test for assessing if a particular distribution provides an
Analysis adequate fit to a data set.
A quantitative enhancement to the probability plot is the correlation coefficient of the points
on the probability plot. For this data set the correlation coefficient is 0.996. Since this is
greater than the critical value of 0.987 (this is a tabulated value), the normality assumption is
not rejected.
Chi-square and Kolmogorov-Smirnov goodness-of-fit tests are alternative methods for
assessing distributional adequacy. The Wilk-Shapiro and Anderson-Darling tests can be used
to test for normality. Dataplot generates the following output for the Anderson-Darling
normality test.
1. STATISTICS:
2. CRITICAL VALUES:
90 % POINT = 0.6560000
95 % POINT = 0.7870000
97.5 % POINT = 0.9180000
99 % POINT = 1.092000
The Anderson-Darling test also does not reject the normality assumption because the test
statistic, 0.129, is less than the critical value at the 5% significance level of 0.918.
Outlier A test for outliers is the Grubbs' test. Dataplot generated the following output for Grubbs' test.
Analysis
1. STATISTICS:
NUMBER OF OBSERVATIONS = 195
MINIMUM = 9.196848
MEAN = 9.261460
MAXIMUM = 9.327973
STANDARD DEVIATION = 0.2278881E-01
For this data set, Grubbs' test does not detect any outliers at the 25%, 10%, 5%, and 1%
significance levels.
Model Since the underlying assumptions were validated both graphically and analytically, with a mild
violation of the randomness assumption, we conclude that a reasonable model for the data is:
We can express the uncertainty for C, here estimated by 9.26146, as the 95% confidence
interval (9.258242,9.26479).
Univariate It is sometimes useful and convenient to summarize the above results in a report. The report
Report for the heat flow meter data follows.
2: Location
Mean = 9.26146
Standard Deviation of Mean = 0.001632
95% Confidence Interval for Mean = (9.258242,9.264679)
Drift with respect to location? = NO
3: Variation
Standard Deviation = 0.022789
95% Confidence Interval for SD = (0.02073,0.025307)
Drift with respect to variation?
(based on Bartlett's test on quarters
of the data) = NO
4: Randomness
Autocorrelation = 0.280579
Data are Random?
(as measured by autocorrelation) = NO
5: Distribution
Normal PPCC = 0.998965
Data are Normal?
(as measured by Normal PPCC) = YES
6: Statistical Control
(i.e., no drift in location or scale,
data are random, distribution is
fixed, here we are testing only for
fixed normal)
Data Set is in Statistical Control? = YES
7: Outliers?
(as determined by Grubbs' test) = NO
Click on the links below to start Dataplot and run this case study
yourself. Each step may use results from previous steps, so please be The links in this column will connect you with more detailed information
patient. Wait until the software verifies that the current step is about each analysis step from the case study description.
complete before clicking on the next step.
1. Generate a run sequence plot. 1. The run sequence plot indicates that
there are no shifts of location or
scale.
2. Generate a lag plot. 2. The lag plot does not indicate any
significant patterns (which would
show the data were not random).
5. Check for normality by computing the 5. The normal probability plot correlation
normal probability plot correlation coefficient is 0.999. At the 5% level,
coefficient. we cannot reject the normality assumption.
6. Check for outliers using Grubbs' test. 6. Grubbs' test detects no outliers at the
5% level.
Purpose of The goal of this case study is to find a good distributional model for the
Analysis data. Once a good distributional model has been determined, various
percent points for glass failure will be computed.
Since the data are failure times, this case study is a form of reliability
analysis. The assessing product reliability chapter contains a more
complete discussion of reliabilty methods. This case study is meant to
complement that chapter by showing the use of graphical techniques in
one aspect of reliability modeling.
Failure times are basically extreme values that do not follow a normal
distribution; non-parametric methods (techniques that do not rely on a
specific distribution) are frequently recommended for developing
confidence intervals for failure data. One problem with this approach is
that sample sizes are often small due to the expense involved in
collecting the data, and non-parametric methods do not work well for
small sample sizes. For this reason, a parametric method based on a
specific distributional model of the data is preferred if the data can be
shown to follow a specific distribution. Parametric models typically
have greater efficiency at the cost of more specific assumptions about
the data, but, it is important to verify that the distributional assumption
is indeed valid. If the distributional assumption is not justified, then the
conclusions drawn from the model may not be valid.
This file can be read by Dataplot with the following commands:
SKIP 25
READ FULLER2.DAT Y
Resulting The following are the data used for this case study.
Data
18.830
20.800
21.657
23.030
23.230
24.050
24.321
25.500
25.520
25.800
26.690
26.770
26.780
27.050
27.670
29.900
31.110
33.200
33.730
33.760
33.890
34.760
35.750
35.910
36.980
37.080
37.090
39.580
44.045
45.290
45.381
● The data are somewhat symmetric, but with a gap in the middle.
The normal probability plot has a correlation coefficient of 0.980. We can use
this number as a reference baseline when comparing the performance of other
distributional fits.
Other Potential There is a large number of distributions that would be distributional model
Distributions candidates for the data. However, we will restrict ourselves to consideration of
the following distributional models because these have proven to be useful in
reliability studies.
1. Normal distribution
2. Exponential distribution
3. Weibull distribution
4. Lognormal distribution
5. Gamma distribution
6. Power normal distribution
7. Power lognormal distribution
Analyses for We analyzed the data using the approach described above for the following
Specific distributional models:
Distributions 1. Normal distribution - from the 4-plot above, the PPCC value was 0.980.
2. Exponential distribution - the exponential distribution is a special case of
the Weibull with shape parameter equal to 1. If the Weibull analysis
yields a shape parameter close to 1, then we would consider using the
simpler exponential model.
3. Weibull distribution
4. Lognormal distribution
5. Gamma distribution
6. Power normal distribution
7. Power lognormal distribution
Percent Point The final step in this analysis is to compute percent point estimates for the 1%,
Estimates 2.5%, 5%, 95%, 97.5%, and 99% percent points. A percent point estimate is an
estimate of the time by which a given percentage of the units will have failed.
For example, the 5% point is the time at which we estimate 5% of the units will
have failed.
To calculate these values, we use the Weibull percent point function with the
appropriate estimates of the shape, location, and scale parameters. The Weibull
percent point function can be computed in many general purpose statistical
software programs, including Dataplot.
Normal
Anderson-Darling **************************************
** Anderson-Darling normal test y **
Output **************************************
1. STATISTICS:
NUMBER OF OBSERVATIONS = 31
MEAN = 30.81142
STANDARD DEVIATION = 7.253381
2. CRITICAL VALUES:
90 % POINT = 0.6160000
95 % POINT = 0.7350000
97.5 % POINT = 0.8610000
99 % POINT = 1.021000
Lognormal
Anderson-Darling *****************************************
** Anderson-Darling lognormal test y **
Output *****************************************
1. STATISTICS:
NUMBER OF OBSERVATIONS = 31
MEAN = 3.401242
STANDARD DEVIATION = 0.2349026
2. CRITICAL VALUES:
90 % POINT = 0.6160000
95 % POINT = 0.7350000
97.5 % POINT = 0.8610000
99 % POINT = 1.021000
Weibull
Anderson-Darling ***************************************
** Anderson-Darling Weibull test y **
Output ***************************************
1. STATISTICS:
NUMBER OF OBSERVATIONS = 31
MEAN = 30.81142
STANDARD DEVIATION = 7.253381
SHAPE PARAMETER = 4.635379
SCALE PARAMETER = 33.67423
2. CRITICAL VALUES:
90 % POINT = 0.6370000
95 % POINT = 0.7570000
97.5 % POINT = 0.8770000
99 % POINT = 1.038000
Alternative The Weibull plot and the Weibull hazard plot are alternative graphical
Plots analysis procedures to the PPCC plots and probability plots.
These two procedures, especially the Weibull plot, are very commonly
employed. That not withstanding, the disadvantage of these two
procedures is that they both assume that the location parameter (i.e., the
lower bound) is zero and that we are fitting a 2-parameter Weibull
instead of a 3-parameter Weibull. The advantage is that there is an
extensive literature on these methods and they have been designed to
work with either censored or uncensored data.
Weibull Plot
Weibull
Hazard Plot
Click on the links below to start Dataplot and run this case study
yourself. Each step may use results from previous steps, so please be The links in this column will connect you with more detailed information
patient. Wait until the software verifies that the current step is about each analysis step from the case study description.
complete before clicking on the next step.
4. Direction (X4)
For this case study, we are using only half the data. Specifically, we are using
the data with the direction longitudinal. Therefore, we have only three primary
factors
In addtion, we are interested in the nuisance factors
1. Lab
2. Batch
The complete file can be read into Dataplot with the following commands:
DIMENSION 20 VARIABLES
SKIP 50
READ JAHANMI2.DAT RUN RUN LAB BAR SET Y X1 TO X8 BATCH
Resulting The following are the data used for this case study
Data
Run Lab Batch Y X1 X2 X3
1 1 1 608.781 -1 -1 -1
2 1 2 569.670 -1 -1 -1
3 1 1 689.556 -1 -1 -1
4 1 2 747.541 -1 -1 -1
5 1 1 618.134 -1 -1 -1
6 1 2 612.182 -1 -1 -1
7 1 1 680.203 -1 -1 -1
8 1 2 607.766 -1 -1 -1
9 1 1 726.232 -1 -1 -1
10 1 2 605.380 -1 -1 -1
11 1 1 518.655 -1 -1 -1
12 1 2 589.226 -1 -1 -1
13 1 1 740.447 -1 -1 -1
14 1 2 588.375 -1 -1 -1
15 1 1 666.830 -1 -1 -1
16 1 2 531.384 -1 -1 -1
17 1 1 710.272 -1 -1 -1
18 1 2 633.417 -1 -1 -1
19 1 1 751.669 -1 -1 -1
20 1 2 619.060 -1 -1 -1
21 1 1 697.979 -1 -1 -1
22 1 2 632.447 -1 -1 -1
23 1 1 708.583 -1 -1 -1
24 1 2 624.256 -1 -1 -1
25 1 1 624.972 -1 -1 -1
26 1 2 575.143 -1 -1 -1
27 1 1 695.070 -1 -1 -1
28 1 2 549.278 -1 -1 -1
29 1 1 769.391 -1 -1 -1
30 1 2 624.972 -1 -1 -1
61 1 1 720.186 -1 1 1
62 1 2 587.695 -1 1 1
63 1 1 723.657 -1 1 1
64 1 2 569.207 -1 1 1
65 1 1 703.700 -1 1 1
66 1 2 613.257 -1 1 1
67 1 1 697.626 -1 1 1
68 1 2 565.737 -1 1 1
69 1 1 714.980 -1 1 1
70 1 2 662.131 -1 1 1
71 1 1 657.712 -1 1 1
72 1 2 543.177 -1 1 1
73 1 1 609.989 -1 1 1
74 1 2 512.394 -1 1 1
75 1 1 650.771 -1 1 1
76 1 2 611.190 -1 1 1
77 1 1 707.977 -1 1 1
78 1 2 659.982 -1 1 1
79 1 1 712.199 -1 1 1
80 1 2 569.245 -1 1 1
81 1 1 709.631 -1 1 1
82 1 2 725.792 -1 1 1
83 1 1 703.160 -1 1 1
84 1 2 608.960 -1 1 1
85 1 1 744.822 -1 1 1
86 1 2 586.060 -1 1 1
87 1 1 719.217 -1 1 1
88 1 2 617.441 -1 1 1
89 1 1 619.137 -1 1 1
90 1 2 592.845 -1 1 1
151 2 1 753.333 1 1 1
152 2 2 631.754 1 1 1
153 2 1 677.933 1 1 1
154 2 2 588.113 1 1 1
155 2 1 735.919 1 1 1
156 2 2 555.724 1 1 1
157 2 1 695.274 1 1 1
158 2 2 702.411 1 1 1
159 2 1 504.167 1 1 1
160 2 2 631.754 1 1 1
161 2 1 693.333 1 1 1
162 2 2 698.254 1 1 1
163 2 1 625.000 1 1 1
164 2 2 616.791 1 1 1
165 2 1 596.667 1 1 1
166 2 2 551.953 1 1 1
167 2 1 640.898 1 1 1
168 2 2 636.738 1 1 1
169 2 1 720.506 1 1 1
170 2 2 571.551 1 1 1
171 2 1 700.748 1 1 1
172 2 2 521.667 1 1 1
173 2 1 691.604 1 1 1
174 2 2 587.451 1 1 1
175 2 1 636.738 1 1 1
176 2 2 700.422 1 1 1
177 2 1 731.667 1 1 1
178 2 2 595.819 1 1 1
179 2 1 635.079 1 1 1
180 2 2 534.236 1 1 1
181 2 1 716.926 1 -1 -1
182 2 2 606.188 1 -1 -1
183 2 1 759.581 1 -1 -1
184 2 2 575.303 1 -1 -1
185 2 1 673.903 1 -1 -1
186 2 2 590.628 1 -1 -1
187 2 1 736.648 1 -1 -1
188 2 2 729.314 1 -1 -1
189 2 1 675.957 1 -1 -1
190 2 2 619.313 1 -1 -1
191 2 1 729.230 1 -1 -1
192 2 2 624.234 1 -1 -1
193 2 1 697.239 1 -1 -1
194 2 2 651.304 1 -1 -1
195 2 1 728.499 1 -1 -1
196 2 2 724.175 1 -1 -1
197 2 1 797.662 1 -1 -1
198 2 2 583.034 1 -1 -1
199 2 1 668.530 1 -1 -1
200 2 2 620.227 1 -1 -1
201 2 1 815.754 1 -1 -1
202 2 2 584.861 1 -1 -1
203 2 1 777.392 1 -1 -1
204 2 2 565.391 1 -1 -1
205 2 1 712.140 1 -1 -1
206 2 2 622.506 1 -1 -1
207 2 1 663.622 1 -1 -1
208 2 2 628.336 1 -1 -1
209 2 1 684.181 1 -1 -1
210 2 2 587.145 1 -1 -1
271 3 1 629.012 1 -1 1
272 3 2 584.319 1 -1 1
273 3 1 640.193 1 -1 1
274 3 2 538.239 1 -1 1
275 3 1 644.156 1 -1 1
276 3 2 538.097 1 -1 1
277 3 1 642.469 1 -1 1
278 3 2 595.686 1 -1 1
279 3 1 639.090 1 -1 1
280 3 2 648.935 1 -1 1
281 3 1 439.418 1 -1 1
282 3 2 583.827 1 -1 1
283 3 1 614.664 1 -1 1
284 3 2 534.905 1 -1 1
285 3 1 537.161 1 -1 1
286 3 2 569.858 1 -1 1
287 3 1 656.773 1 -1 1
288 3 2 617.246 1 -1 1
289 3 1 659.534 1 -1 1
290 3 2 610.337 1 -1 1
291 3 1 695.278 1 -1 1
292 3 2 584.192 1 -1 1
293 3 1 734.040 1 -1 1
294 3 2 598.853 1 -1 1
295 3 1 687.665 1 -1 1
296 3 2 554.774 1 -1 1
297 3 1 710.858 1 -1 1
298 3 2 605.694 1 -1 1
299 3 1 701.716 1 -1 1
300 3 2 627.516 1 -1 1
301 3 1 382.133 1 1 -1
302 3 2 574.522 1 1 -1
303 3 1 719.744 1 1 -1
304 3 2 582.682 1 1 -1
305 3 1 756.820 1 1 -1
306 3 2 563.872 1 1 -1
307 3 1 690.978 1 1 -1
308 3 2 715.962 1 1 -1
309 3 1 670.864 1 1 -1
310 3 2 616.430 1 1 -1
311 3 1 670.308 1 1 -1
312 3 2 778.011 1 1 -1
313 3 1 660.062 1 1 -1
314 3 2 604.255 1 1 -1
315 3 1 790.382 1 1 -1
316 3 2 571.906 1 1 -1
317 3 1 714.750 1 1 -1
318 3 2 625.925 1 1 -1
319 3 1 716.959 1 1 -1
320 3 2 682.426 1 1 -1
321 3 1 603.363 1 1 -1
322 3 2 707.604 1 1 -1
323 3 1 713.796 1 1 -1
324 3 2 617.400 1 1 -1
325 3 1 444.963 1 1 -1
326 3 2 689.576 1 1 -1
327 3 1 723.276 1 1 -1
328 3 2 676.678 1 1 -1
329 3 1 745.527 1 1 -1
330 3 2 563.290 1 1 -1
361 4 1 778.333 -1 -1 1
362 4 2 581.879 -1 -1 1
363 4 1 723.349 -1 -1 1
364 4 2 447.701 -1 -1 1
365 4 1 708.229 -1 -1 1
366 4 2 557.772 -1 -1 1
367 4 1 681.667 -1 -1 1
368 4 2 593.537 -1 -1 1
369 4 1 566.085 -1 -1 1
370 4 2 632.585 -1 -1 1
371 4 1 687.448 -1 -1 1
372 4 2 671.350 -1 -1 1
373 4 1 597.500 -1 -1 1
374 4 2 569.530 -1 -1 1
375 4 1 637.410 -1 -1 1
376 4 2 581.667 -1 -1 1
377 4 1 755.864 -1 -1 1
378 4 2 643.449 -1 -1 1
379 4 1 692.945 -1 -1 1
380 4 2 581.593 -1 -1 1
381 4 1 766.532 -1 -1 1
382 4 2 494.122 -1 -1 1
383 4 1 725.663 -1 -1 1
384 4 2 620.948 -1 -1 1
385 4 1 698.818 -1 -1 1
386 4 2 615.903 -1 -1 1
387 4 1 760.000 -1 -1 1
388 4 2 606.667 -1 -1 1
389 4 1 775.272 -1 -1 1
390 4 2 579.167 -1 -1 1
421 4 1 708.885 -1 1 -1
422 4 2 662.510 -1 1 -1
423 4 1 727.201 -1 1 -1
424 4 2 436.237 -1 1 -1
425 4 1 642.560 -1 1 -1
426 4 2 644.223 -1 1 -1
427 4 1 690.773 -1 1 -1
428 4 2 586.035 -1 1 -1
429 4 1 688.333 -1 1 -1
430 4 2 620.833 -1 1 -1
431 4 1 743.973 -1 1 -1
432 4 2 652.535 -1 1 -1
433 4 1 682.461 -1 1 -1
434 4 2 593.516 -1 1 -1
435 4 1 761.430 -1 1 -1
436 4 2 587.451 -1 1 -1
437 4 1 691.542 -1 1 -1
438 4 2 570.964 -1 1 -1
439 4 1 643.392 -1 1 -1
440 4 2 645.192 -1 1 -1
441 4 1 697.075 -1 1 -1
442 4 2 540.079 -1 1 -1
443 4 1 708.229 -1 1 -1
444 4 2 707.117 -1 1 -1
445 4 1 746.467 -1 1 -1
446 4 2 621.779 -1 1 -1
447 4 1 744.819 -1 1 -1
448 4 2 585.777 -1 1 -1
449 4 1 655.029 -1 1 -1
450 4 2 703.980 -1 1 -1
541 5 1 715.224 -1 -1 -1
542 5 2 698.237 -1 -1 -1
543 5 1 614.417 -1 -1 -1
544 5 2 757.120 -1 -1 -1
545 5 1 761.363 -1 -1 -1
546 5 2 621.751 -1 -1 -1
547 5 1 716.106 -1 -1 -1
548 5 2 472.125 -1 -1 -1
549 5 1 659.502 -1 -1 -1
550 5 2 612.700 -1 -1 -1
551 5 1 730.781 -1 -1 -1
552 5 2 583.170 -1 -1 -1
553 5 1 546.928 -1 -1 -1
554 5 2 599.771 -1 -1 -1
555 5 1 734.203 -1 -1 -1
556 5 2 549.227 -1 -1 -1
557 5 1 682.051 -1 -1 -1
558 5 2 605.453 -1 -1 -1
559 5 1 701.341 -1 -1 -1
560 5 2 569.599 -1 -1 -1
561 5 1 759.729 -1 -1 -1
562 5 2 637.233 -1 -1 -1
563 5 1 689.942 -1 -1 -1
564 5 2 621.774 -1 -1 -1
565 5 1 769.424 -1 -1 -1
566 5 2 558.041 -1 -1 -1
567 5 1 715.286 -1 -1 -1
568 5 2 583.170 -1 -1 -1
569 5 1 776.197 -1 -1 -1
570 5 2 345.294 -1 -1 -1
571 5 1 547.099 1 -1 1
572 5 2 570.999 1 -1 1
573 5 1 619.942 1 -1 1
574 5 2 603.232 1 -1 1
575 5 1 696.046 1 -1 1
576 5 2 595.335 1 -1 1
577 5 1 573.109 1 -1 1
578 5 2 581.047 1 -1 1
579 5 1 638.794 1 -1 1
580 5 2 455.878 1 -1 1
581 5 1 708.193 1 -1 1
582 5 2 627.880 1 -1 1
583 5 1 502.825 1 -1 1
584 5 2 464.085 1 -1 1
585 5 1 632.633 1 -1 1
586 5 2 596.129 1 -1 1
587 5 1 683.382 1 -1 1
588 5 2 640.371 1 -1 1
589 5 1 684.812 1 -1 1
590 5 2 621.471 1 -1 1
591 5 1 738.161 1 -1 1
592 5 2 612.727 1 -1 1
593 5 1 671.492 1 -1 1
594 5 2 606.460 1 -1 1
595 5 1 709.771 1 -1 1
596 5 2 571.760 1 -1 1
597 5 1 685.199 1 -1 1
598 5 2 599.304 1 -1 1
599 5 1 624.973 1 -1 1
600 5 2 579.459 1 -1 1
601 6 1 757.363 1 1 1
602 6 2 761.511 1 1 1
603 6 1 633.417 1 1 1
604 6 2 566.969 1 1 1
605 6 1 658.754 1 1 1
606 6 2 654.397 1 1 1
607 6 1 664.666 1 1 1
608 6 2 611.719 1 1 1
609 6 1 663.009 1 1 1
610 6 2 577.409 1 1 1
611 6 1 773.226 1 1 1
612 6 2 576.731 1 1 1
613 6 1 708.261 1 1 1
614 6 2 617.441 1 1 1
615 6 1 739.086 1 1 1
616 6 2 577.409 1 1 1
617 6 1 667.786 1 1 1
618 6 2 548.957 1 1 1
619 6 1 674.481 1 1 1
620 6 2 623.315 1 1 1
621 6 1 695.688 1 1 1
622 6 2 621.761 1 1 1
623 6 1 588.288 1 1 1
624 6 2 553.978 1 1 1
625 6 1 545.610 1 1 1
626 6 2 657.157 1 1 1
627 6 1 752.305 1 1 1
628 6 2 610.882 1 1 1
629 6 1 684.523 1 1 1
630 6 2 552.304 1 1 1
631 6 1 717.159 -1 1 -1
632 6 2 545.303 -1 1 -1
633 6 1 721.343 -1 1 -1
634 6 2 651.934 -1 1 -1
635 6 1 750.623 -1 1 -1
636 6 2 635.240 -1 1 -1
637 6 1 776.488 -1 1 -1
638 6 2 641.083 -1 1 -1
639 6 1 750.623 -1 1 -1
640 6 2 645.321 -1 1 -1
641 6 1 600.840 -1 1 -1
642 6 2 566.127 -1 1 -1
643 6 1 686.196 -1 1 -1
644 6 2 647.844 -1 1 -1
645 6 1 687.870 -1 1 -1
646 6 2 554.815 -1 1 -1
647 6 1 725.527 -1 1 -1
648 6 2 620.087 -1 1 -1
649 6 1 658.796 -1 1 -1
650 6 2 711.301 -1 1 -1
651 6 1 690.380 -1 1 -1
652 6 2 644.355 -1 1 -1
653 6 1 737.144 -1 1 -1
654 6 2 713.812 -1 1 -1
655 6 1 663.851 -1 1 -1
656 6 2 696.707 -1 1 -1
657 6 1 766.630 -1 1 -1
658 6 2 589.453 -1 1 -1
659 6 1 625.922 -1 1 -1
660 6 2 634.468 -1 1 -1
721 7 1 694.430 1 1 -1
722 7 2 599.751 1 1 -1
723 7 1 730.217 1 1 -1
724 7 2 624.542 1 1 -1
725 7 1 700.770 1 1 -1
726 7 2 723.505 1 1 -1
727 7 1 722.242 1 1 -1
728 7 2 674.717 1 1 -1
729 7 1 763.828 1 1 -1
730 7 2 608.539 1 1 -1
731 7 1 695.668 1 1 -1
732 7 2 612.135 1 1 -1
733 7 1 688.887 1 1 -1
734 7 2 591.935 1 1 -1
735 7 1 531.021 1 1 -1
736 7 2 676.656 1 1 -1
737 7 1 698.915 1 1 -1
738 7 2 647.323 1 1 -1
739 7 1 735.905 1 1 -1
740 7 2 811.970 1 1 -1
741 7 1 732.039 1 1 -1
742 7 2 603.883 1 1 -1
743 7 1 751.832 1 1 -1
744 7 2 608.643 1 1 -1
745 7 1 618.663 1 1 -1
746 7 2 630.778 1 1 -1
747 7 1 744.845 1 1 -1
748 7 2 623.063 1 1 -1
749 7 1 690.826 1 1 -1
750 7 2 472.463 1 1 -1
811 7 1 666.893 -1 1 1
812 7 2 645.932 -1 1 1
813 7 1 759.860 -1 1 1
814 7 2 577.176 -1 1 1
815 7 1 683.752 -1 1 1
816 7 2 567.530 -1 1 1
817 7 1 729.591 -1 1 1
818 7 2 821.654 -1 1 1
819 7 1 730.706 -1 1 1
820 7 2 684.490 -1 1 1
821 7 1 763.124 -1 1 1
822 7 2 600.427 -1 1 1
823 7 1 724.193 -1 1 1
824 7 2 686.023 -1 1 1
825 7 1 630.352 -1 1 1
826 7 2 628.109 -1 1 1
827 7 1 750.338 -1 1 1
828 7 2 605.214 -1 1 1
829 7 1 752.417 -1 1 1
830 7 2 640.260 -1 1 1
831 7 1 707.899 -1 1 1
832 7 2 700.767 -1 1 1
833 7 1 715.582 -1 1 1
834 7 2 665.924 -1 1 1
835 7 1 728.746 -1 1 1
836 7 2 555.926 -1 1 1
837 7 1 591.193 -1 1 1
838 7 2 543.299 -1 1 1
839 7 1 592.252 -1 1 1
840 7 2 511.030 -1 1 1
901 8 1 740.833 -1 -1 1
902 8 2 583.994 -1 -1 1
903 8 1 786.367 -1 -1 1
904 8 2 611.048 -1 -1 1
905 8 1 712.386 -1 -1 1
906 8 2 623.338 -1 -1 1
907 8 1 738.333 -1 -1 1
908 8 2 679.585 -1 -1 1
909 8 1 741.480 -1 -1 1
910 8 2 665.004 -1 -1 1
911 8 1 729.167 -1 -1 1
912 8 2 655.860 -1 -1 1
913 8 1 795.833 -1 -1 1
914 8 2 715.711 -1 -1 1
915 8 1 723.502 -1 -1 1
916 8 2 611.999 -1 -1 1
917 8 1 718.333 -1 -1 1
918 8 2 577.722 -1 -1 1
919 8 1 768.080 -1 -1 1
920 8 2 615.129 -1 -1 1
921 8 1 747.500 -1 -1 1
922 8 2 540.316 -1 -1 1
923 8 1 775.000 -1 -1 1
924 8 2 711.667 -1 -1 1
925 8 1 760.599 -1 -1 1
926 8 2 639.167 -1 -1 1
927 8 1 758.333 -1 -1 1
928 8 2 549.491 -1 -1 1
929 8 1 682.500 -1 -1 1
930 8 2 684.167 -1 -1 1
931 8 1 658.116 1 -1 -1
932 8 2 672.153 1 -1 -1
933 8 1 738.213 1 -1 -1
934 8 2 594.534 1 -1 -1
935 8 1 681.236 1 -1 -1
936 8 2 627.650 1 -1 -1
937 8 1 704.904 1 -1 -1
938 8 2 551.870 1 -1 -1
939 8 1 693.623 1 -1 -1
940 8 2 594.534 1 -1 -1
941 8 1 624.993 1 -1 -1
942 8 2 602.660 1 -1 -1
943 8 1 700.228 1 -1 -1
944 8 2 585.450 1 -1 -1
945 8 1 611.874 1 -1 -1
946 8 2 555.724 1 -1 -1
947 8 1 579.167 1 -1 -1
948 8 2 574.934 1 -1 -1
949 8 1 720.872 1 -1 -1
950 8 2 584.625 1 -1 -1
951 8 1 690.320 1 -1 -1
952 8 2 555.724 1 -1 -1
953 8 1 677.933 1 -1 -1
954 8 2 611.874 1 -1 -1
955 8 1 674.600 1 -1 -1
956 8 2 698.254 1 -1 -1
957 8 1 611.999 1 -1 -1
958 8 2 748.130 1 -1 -1
959 8 1 530.680 1 -1 -1
960 8 2 689.942 1 -1 -1
SUMMARY
***********************************************************************
* LOCATION MEASURES * DISPERSION MEASURES *
***********************************************************************
* MIDRANGE = 0.5834740E+03 * RANGE = 0.4763600E+03 *
* MEAN = 0.6500773E+03 * STAND. DEV. = 0.7463826E+02 *
* MIDMEAN = 0.6426155E+03 * AV. AB. DEV. = 0.6184948E+02 *
* MEDIAN = 0.6466275E+03 * MINIMUM = 0.3452940E+03 *
* = * LOWER QUART. = 0.5960515E+03 *
* = * LOWER HINGE = 0.5959740E+03 *
* = * UPPER HINGE = 0.7084220E+03 *
* = * UPPER QUART. = 0.7083415E+03 *
* = * MAXIMUM = 0.8216540E+03 *
***********************************************************************
* RANDOMNESS MEASURES * DISTRIBUTIONAL MEASURES *
***********************************************************************
* AUTOCO COEF = -0.2290508E+00 * ST. 3RD MOM. = -0.3682922E+00 *
* = 0.0000000E+00 * ST. 4TH MOM. = 0.3220554E+01 *
* = 0.0000000E+00 * ST. WILK-SHA = 0.3877698E+01 *
* = * UNIFORM PPCC = 0.9756916E+00 *
* = * NORMAL PPCC = 0.9906310E+00 *
* = * TUK -.5 PPCC = 0.8357126E+00 *
* = * CAUCHY PPCC = 0.5063868E+00 *
***********************************************************************
From the above output, the mean strength is 650.08 and the standard deviation of the
strength is 74.64.
Bihistogram
Quantile-Quantile
Plot
Box Plot
2. The spread is reasonably similar for both batches, maybe slightly larger for batch
1.
3. Both batches have a number of outliers on the low side. Batch 2 also has a few
outliers on the high side. Box plots are a particularly effective method for
identifying the presence of outliers.
Block Plots A block plot is generated for each of the eight labs, with "1" and "2" denoting the batch
numbers. In the first plot, we do not include any of the primary factors. The next 3
block plots include one of the primary factors. Note that each of the 3 primary factors
(table speed = X1, down feed rate = X2, wheel grit size = X3) has 2 levels. With 8 labs
and 2 levels for the primary factor, we would expect 16 separate blocks on these plots.
The fact that some of these blocks are missing indicates that some of the combinations
of lab and primary factor are empty.
Quantitative We can confirm some of the conclusions drawn from the above graphics by using
Techniques quantitative techniques. The two sample t-test can be used to test whether or not the
means from the two batches are equal and the F-test can be used to test whether or not
the standard deviations from the two batches are equal.
Two Sample The following is the Dataplot output from the two sample t-test.
T-Test
T-TEST
(2-SAMPLE)
NULL HYPOTHESIS UNDER TEST--POPULATION MEANS MU1 = MU2
SAMPLE 1:
NUMBER OF OBSERVATIONS = 240
MEAN = 688.9987
STANDARD DEVIATION = 65.54909
STANDARD DEVIATION OF MEAN = 4.231175
SAMPLE 2:
NUMBER OF OBSERVATIONS = 240
MEAN = 611.1559
STANDARD DEVIATION = 61.85425
STANDARD DEVIATION OF MEAN = 3.992675
ALTERNATIVE- ALTERNATIVE-
ALTERNATIVE- HYPOTHESIS HYPOTHESIS
HYPOTHESIS ACCEPTANCE INTERVAL CONCLUSION
MU1 <> MU2 (0,0.025) (0.975,1) ACCEPT
MU1 < MU2 (0,0.05) REJECT
MU1 > MU2 (0.95,1) ACCEPT
The t-test indicates that the mean for batch 1 is larger than the mean for batch 2 (at the
5% confidence level).
F-TEST
NULL HYPOTHESIS UNDER TEST--SIGMA1 = SIGMA2
ALTERNATIVE HYPOTHESIS UNDER TEST--SIGMA1 NOT EQUAL SIGMA2
SAMPLE 1:
NUMBER OF OBSERVATIONS = 240
MEAN = 688.9987
STANDARD DEVIATION = 65.54909
SAMPLE 2:
NUMBER OF OBSERVATIONS = 240
MEAN = 611.1559
STANDARD DEVIATION = 61.85425
TEST:
STANDARD DEV. (NUMERATOR) = 65.54909
STANDARD DEV. (DENOMINATOR) = 61.85425
F-TEST STATISTIC VALUE = 1.123037
DEG. OF FREEDOM (NUMER.) = 239.0000
DEG. OF FREEDOM (DENOM.) = 239.0000
F-TEST STATISTIC CDF VALUE = 0.814808
Conclusions We can draw the following conclusions from the above analysis.
1. There is in fact a significant batch effect. This batch effect is consistent across
labs and primary factors.
2. The magnitude of the difference is on the order of 75 to 100 (with batch 2 being
smaller than batch 1). The standard deviations do not appear to be significantly
different.
3. There is some skewness in the batches.
This batch effect was completely unexpected by the scientific investigators in this
study.
Note that although the quantitative techniques support the conclusions of unequal
means and equal standard deviations, they do not show the more subtle features of the
data such as the presence of outliers and the skewness of the batch 2 data.
Box Plot for Given that the previous section showed a distinct batch effect, the next
Batch 1 step is to generate the box plots for the two batches separately.
Conclusions We can draw the following conclusions about a possible lab effect from
the above box plots.
1. The batch effect (of approximately 75 to 100 units) on location
dominates any lab effects.
2. It is reasonable to treat the labs as homogeneous.
Dex Scatter
Plot for
Batch 1
Dex Mean
Plot for
Batch 1
Dex SD Plot
for Batch 1
This dex standard deviation plot shows the following for batch 1.
1. The table speed factor (X1) has a significant difference in
variability between the levels of the factor. The difference is
approximately 20 units.
2. The wheel grit factor (X3) and the feed rate factor (X2) have
minimal differences in variability.
Dex Scatter
Plot for
Batch 2
Dex Mean
Plot for
Batch 2
Dex SD Plot
for Batch 2
This dex standard deviation plot shows the following for batch 2.
1. The difference in the standard deviations is roughly comparable
for the three factors (slightly less for the feed rate factor).
Interaction The above plots graphically show the main effects. An additonal
Effects concern is whether or not there any significant interaction effects.
Main effects and 2-term interaction effects are discussed in the chapter
on Process Improvement.
In the following dex interaction plots, the labels on the plot give the
variables and the estimated effect. For example, factor 1 is TABLE
SPEED and it has an estimated effect of 30.77 (it is actually -30.77 if
the direction is taken into account).
DEX
Interaction
Plot for
Batch 1
DEX
Interaction
Plot for
Batch 2
Conclusions From the above plots, we can draw the following overall conclusions.
1. The batch effect (of approximately 75 units) is the dominant
primary factor.
2. The most important factors differ from batch to batch. See the
above text for the ranked list of factors with the estimated effects.
Click on the links below to start Dataplot and run this case
study yourself. Each step may use results from previous The links in this column will connect you with more
steps, so please be patient. Wait until the software verifies detailed information about each analysis step from the case
that the current step is complete before clicking on the next study description.
step.
1. Generate a box plot for the labs 1. The box plot does not show a
with the 2 batches combined. significant lab effect.
2. Generate a box plot for the labs 2. The box plot does not show a
for batch 1 only. significant lab effect for batch 1.
3. Generate a box plot for the labs 3. The box plot does not show a
for batch 2 only. significant lab effect for batch 2.
1. Generate a dex scatter plot for 1. The dex scatter plot shows the
batch 1. range of the points and the
presence of outliers.
2. Generate a dex mean plot for 2. The dex mean plot shows that
batch 1. table speed is the most
significant factor for batch 1.
4. Generate a dex scatter plot for 4. The dex scatter plot shows
batch 2. the range of the points and
the presence of outliers.
Bloomfield, Peter (1976), Fourier Analysis of Time Series, John Wiley and Sons.
Box, G. E. P., Hunter, W. G., and Hunter, J. S. (1978), Statistics for Experimenters: An
Introduction to Design, Data Analysis, and Model Building, John Wiley and Sons.
Box, G. E. P., and Jenkins, G. (1976), Time Series Analysis: Forecasting and Control,
Holden-Day.
Chakravarti, Laha, and Roy, (1967). Handbook of Methods of Applied Statistics, Volume
I, John Wiley and Sons, pp. 392-394.
Chambers, John, William Cleveland, Beat Kleiner, and Paul Tukey, (1983), Graphical
Methods for Data Analysis, Wadsworth.
Cleveland, William and Marylyn McGill, Editors (1988), Dynamic Graphics for
Statistics, Wadsworth.
Draper and Smith, (1981). Applied Regression Analysis, 2nd ed., John Wiley and Sons.
Evans, Hastings, and Peacock (2000), Statistical Distributions, 3rd. Ed., John Wiley and
Sons.
Efron and Gong (February 1983), A Leisurely Look at the Bootstrap, the Jackknife, and
Cross Validation, The American Statistician.
Filliben, J. J. (February 1975), The Probability Plot Correlation Coefficient Test for
Normality , Technometrics, pp. 111-117.
Gill, Lisa (April 1997), Summary Analysis: High Performance Ceramics Experiment to
Characterize the Effect of Grinding Parameters on Sintered Reaction Bonded Silicon
Nitride, Reaction Bonded Silicon Nitride, and Sintered Silicon Nitride , presented at the
NIST - Ceramic Machining Consortium, 10th Program Review Meeting, April 10, 1997.
Granger and Hatanaka (1964). Spectral Analysis of Economic Time Series, Princeton
University Press.
Jenkins and Watts, (1968), Spectral Analysis and Its Applications, Holden-Day.
Johnson, Kotz, and Kemp, (1992), Univariate Discrete Distributions, 2nd. Ed., John
Wiley and Sons.
Kuo, Way and Pierson, Marcia Martens, Eds. (1993), Quality Through Engineering
Design", specifically, the article Filliben, Cetinkunt, Yu, and Dommenz (1993),
Exploratory Data Analysis Techniques as Applied to a High-Precision Turning Machine,
Elsevier, New York, pp. 199-223.
McNeil, Donald (1977), Interactive Data Analysis, John Wiley and Sons.
Mosteller, Frederick and Tukey, John (1977), Data Analysis and Regression,
Addison-Wesley.
Neter, Wasserman, and Kunter (1990). Applied Linear Statistical Models, 3rd ed., Irwin.
Nelson, Wayne and Doganaksoy, Necip (1992), A Computer Program POWNOR for
Fitting the Power-Normal and -Lognormal Models to Life or Strength Data from
Specimens of Various Sizes, NISTIR 4760, U.S. Department of Commerce, National
Institute of Standards and Technology.
The RAND Corporation (1955), A Million Random Digits with 100,000 Normal
Deviates, Free Press.
Stephens, M. A. (1974). EDF Statistics for Goodness of Fit and Some Comparisons,
Journal of the American Statistical Association, Vol. 69, pp. 730-737.
Stephens, M. A. (1979). Tests of Fit for the Logistic Distribution Based on the Empirical
Distribution Function, Biometrika, Vol. 66, pp. 591-595.
Tufte, Edward (1983), The Visual Display of Quantitative Information, Graphics Press.
Velleman, Paul and Hoaglin, David (1981), The ABC's of EDA: Applications, Basics,
and Computing of Exploratory Data Analysis, Duxbury.