Prob & Stats (Slides) PDF
Prob & Stats (Slides) PDF
1
Why statistics matter?
Given a sufficiently precise instrument, no two
measurements (taken at different times or locations)
will be exactly the same.
3
Statistical Measurement Theory
A sample of data refers to a set of data obtained during
repeated measurements of a variable under fixed
operating conditions (measured variable = measurand).
4
Best Estimate
The range within which the true value will lie with P% probability
Finite-sized data can only give the estimate of the true value.
Ex: measuring the diameter of a certain number of the manufactured
bearings and estimating the batch’s diameter.
5
Random vs. Systematic Error
For now, we will estimate 𝒙′ and the random error in 𝒙 caused only
by the variation in the data set (effects of random error that is
called random uncertainty).
6
Why Statistical Analysis?
Statistical analysis characterizes experimental data by
determining parameters that specify the central tendency
and the dispersion (spread) of the data.
7
Common Terms
8
Common Terms
Random Variable- The variables being measured in
an experiment are considered random variables. The
outcome of the measurement is not unique and
influenced by many uncontrollable factors (generally
unavoidable).
9
Random Variables
10
Population vs. Sample
Population - A total set of all process results
Sample - A subset of a population
POPULATION Sample
11 s standard deviation S
Measures of Central Tendency
Mean – also known as average; sum of all values divided
by number of values
N
Population Mean m 1
N x
i 1
i
n
Sample Mean x 1n xi
i 1
i 1
Deviation from the mean
n
Sample Variance S x x
2 1 2
n 1 i
i 1
mean
Dot Diagram
14
Graphical Representation
Histogram visually represent data centering, variability, and
shape
Histogram
Sum from Rolling Two Six Sided Dice
Properties
16
14
• All data will fall into a
Frequency
12
10 class or bin
8 • No data will overlap
6
4
2
0
2 3 4 5 6 7 8 9 10 11 12
15
Sum of Dice
Histogram
min
max
(Large N)
16
Effect of data size
5 100
50
1000
17
Effect of interval numbers
As the bin width
tends to zero; the
envelope of the
histogram becomes
a function which
can be evaluated
for any value of x.
Continuous
envelope
18
Probability Density Function
(𝑓(𝑥) or 𝑝(𝑥))
A probability density function, f(x), defines the
probability of occurrence of the random variable in
an interval between xi and xi+dx
P xi x xi dx f xi dx 0.2
Pa x b f x dx
b
0.15
a
P x 1
f(x)
0.1
m xf x dx
0.05
s 2
x m f x dx
2
0
0 5 10 15 20
x
19
21
22
Population
mean
25
infant mortality failures
End-of-life wear-out failure
31 * Experimentally determined histograms are used to infer guidance on the most likely model for p(x).
Gaussian (Normal) Distribution
Frequently, a stable, controlled process will produce a histogram
that resembles the bell shaped curve also known as the
Gaussian Distribution (Normal) Distribution of birth weight in 3,226 newborn
babies (data from O' Cathain et al 2002)
Common applications:
Exam scores
Human body temperature
Human birth weight
Dimensional tolerances
Employee performance
Continuous Data
Typically 2 parameters
Scale parameter = mean (m)
Shape parameter = standard deviation (s)
2
1 xm
PDF 1 2 s
f ( x) e
s 2
Maximum occurs at 𝑥 = 𝜇
34
Effect of Standard Deviation (Spread)
35
Distributions and Probability
Distributions can be linked to probability – making possible predictions
and evaluations of the likelihood of a particular occurrence
2
1 xm
Px1 x x2 f ( x)dx
x2 x2 1 2 s
e dx
x1 x1
s 2
36
Standard Normal (z) Distribution
xm
x ~ N ( m ,s ) z ~ N (0,1)
s
• Probabilities of certain ranges of values and specific
percentiles of interest can be obtained through the
standard normal (z) distribution
37
Standard Normal (z) Distribution
Given p(𝑥), how can we predict the probability that any future
measurement will fall within some stated interval of 𝑥 values?
38
Standard Normal (z) Distribution
tabulated
39
40
Standard Normal Distribution
Normal (m=0, s=1)
xm
Standard(ized) normal variate: z
s
41
Example
( x ' (1.0)s ) x '
x (1.0)s z
'
1.0
s
tabulated
42
The probability that the ith measured value of x will have a
value between
43
Example - Heights of U.S. Adults
• Female and Male adult heights are well approximated by
normal distributions: xF~N(63.7,2.5) xM~N(69.1,2.6)
20
20
18
16
14
12
10 10
INCHESM
INCHESF
Cases weighted by PCTM
Cases weighted by PCTF
46
Example
0.3413
0.5
𝑧≤1
0.4772
𝑧≤2
47
48
Statistics of Finite Data Sets
Estimate the true mean and true variance of a population
from a small sample (inferential statistics).
Measuring some samples for a batch of manufactured goods
sample variance
xi x tv , P S x (P %)
Precision interval:
55
xi x tv , P S x (P %)
56
The Standard Error
(Standard Deviation of the Means)
• With different samples, we would get different estimates of the
sample mean and sample variance
• Each mean value would be normally distributed about some
central value
The normal distribution tendency
of the sample means about a true
value in the absence of
systematic error.
57
The Standard Error
(Standard Deviation of the Means)
(a) (b)
58 The width of the histogram of the means decreases as the size of the sample used to calculate the
means increases—this is a consequence of averaging over statistical fluctuations.
The Standard Error
(Standard Deviation of the Means)
(c) (d)
59 The width of the histogram of the means decreases as the size of the sample used to calculate the
means increases—this is a consequence of averaging over statistical fluctuations.
Interval Estimation of Population Mean
• The variance of the distribution of mean values
is estimated from a finite data set through the
standard deviation of the means:
Sx
Sx
n
• The estimate of the true mean value based on
the finite data set is
m x tv , P S x ( P%)
Confidence interval
Compared to: xi x tv , P S x (P %)
60
Example
(a) Compute the sample statistics for this data set. (b) Estimate
the interval of values over which 95% of the measurements of x
should be expected to lie. (c) Estimate the true mean value of x
at 95% probability based on this finite data set.
62
Example
The interval of values in which 95% of the measurements of x should lie is:
There is a 95% probability that the value of the 21st data point would lie between 0.69 and 1.35
63
t-table
Pooled Statistics
64
Hypothesis Testing
𝐻𝑜 : 𝑥 ′ = 𝑥𝑜 Null hypothesis
𝐻𝑜 : 𝑥 ′ = 𝑥𝑜 Null hypothesis
𝐻𝑎 : 𝑥 ′ ≠ 𝑥𝑜 Two-tailed test
𝐻𝑎 : 𝑥 ′ < 𝑥𝑜 Left-tailed test 𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠
𝐻𝑎 : 𝑥 ′ > 𝑥𝑜 Right-tailed test
𝛼 significance value
(level of significance)
66
Critical values for a hypothesis test at level of significance 𝛼
Population 𝜎 known? z-test : t-test
Need a sample
(Measured) Test statistic is the z-variable:
67 Do not reject
regions
Population 𝜎 known? z-test : t-test
Test statistic is the t-variable:
68 Do not reject
regions
Hypothesis testing steps
′
1. Establish the null hypothesis 𝐻𝑜 : 𝑥 = 𝑥𝑜 , where 𝑥𝑜 is the
population or target value,
70
Example:
Null Hypothesis
𝐻𝑜 : 𝑥 ′ = 𝑥𝑜 = 2.0 mm
71
Example:
Acceptance Region(s)
0.475
0.5-0.025=0.475
0.025 0.025
0.025 0.025
73
z-table
Example:
Observed Test Statistic
Observed
>
Reject the null hypothesis (i.e., the difference between the means of the
population and sample is larger than it would be by just random chance).
74
Example
75
Example:
Null Hypothesis
0.95
𝐻𝑜 : 𝑥 ′ = 𝑥𝑜 = 180.0 yards
76
𝐻𝑜 : 𝑥 ′ = 𝑥𝑜 = 180.0 yards
Example (cont’d)
0.95
Do NOT reject the null hypothesis (i.e., the difference between the means of the
population and sample is NOT larger than it would be by just random chance).
77
The club meets the customer’s needs
Chi-Square Testing
The test provides a measure of the discrepancy between
the measured variation in a quantity (characterized by 𝑆𝑥 )
and the variation predicted by the assumed probability
distribution function, (characterized by 𝜎).
DOF dependency
Area = 𝜶
(measured from
the right)
78
Inference of Population Variance
𝜶 is measured
from the right.
81
Confidence interval for sample
variance
Example: A sample of 20 ball bearings is chosen and measured
with sample mean 0.32500 in and S.D. = 0.00010 in. Determine
a 95% CI for the standard deviation of the production batch
(population; not known).
n = n-1 = 19
a/2 = 0.025, 1- a/2 = 0.975
c219, 0.025=32.9, c219, 0.975=8.91
19(0.000102)/32.9 ≤ s2 ≤ 19(0.000102)/8.91
83
Goodness-of-fit Test
The chi-squared test provides a measure of the discrepancy between the
measured variation of a data set and the variation predicted by the
assumed density function.
87
Regression Analysis
We can use regression analysis to establish a functional
relationship between the dependent variable and the
independent variable.
88
N 2
D yi yci
i
89
90
Least Squares Regression
There is some deviation of data from polynomial, yi - yci
Can calculate standard error of fit:
N
i ci
( y y ) 2
Standard error of fit Higher-order curves
reduce Syx, but they are
S yx i 1
S yx
yc tn , P P% confidence interval
N
92
Least-Square Linear Fit
yc a0 a1 x
• Minimizing the square of the error
N 2
D yi yci
i
93
Least-Square Linear Fit
a0
i i i i yi
x x y x 2
; a1
x y N x y
i i i i
x
i
2
N x
2
i x N x
i
2 2
i
94
Example
95
Regression analysis
• To evaluate how well the relationship between
independent and dependent variables can be described
by a linear relationship, can determine coefficient of
determination, r2 (r is correlation coefficient)
• Indicative of how well the variance in y is accounted for
by the fit sum of squared data residuals;
If r2 is zero, there is no improvement over N If equal to zero, perfect fit…
Simply picking the mean. (y i yci ) 2
r 2 1 i 1
N
i
( y
i 1
y ) 2
y i
y i 1
N
96
Coefficient of determination, r2 (r is correlation coefficient),
only indicates an association between the dependent and
independent variable.
100
Example
i ci
( y y ) 2
r 2 1 i 1
N
(y
i 1
i y)2
Compute the coefficient of N
determination (r2) and the standard
error of the fit for the data
y i
y i 1
N
101 of the variance in y is accounted for by the fit, whereas only 1% is unaccountable.
99%
Example
102
t-table
Another Example
103
Another Example
xi xi2 yi xi yi yi2
0 0 0.05 0 0.0025
104
Another Example
a0
i i i i yi
x x y x 2
.0295
x N x
i
2 2
i
a1
x y N x y
i i i i
0.9977
x N x
i
2 2
i
yc 0.0295 0.9977 x
105
Another Example
standard error of the fit (Sxy) and the coefficient
of determination (r2)
N
N i ci
( y y ) 2
i ci
( y y ) 2
r 2 1 i 1
N
0.999286
S yx i 1
n
0.0278 (y
i 1
i y)2
n N (m 1) N
y i
y i 1
𝑷(𝒛𝟎 )
𝑷(𝒛𝟎 )
𝒛𝟎
?
𝟏 − 𝟐 ∗ 𝑷(𝒛𝟎 )
109
Data Outlier Detection: assumes normal distribution
Three-sigma Test
• For large data sets (n>10), the three-sigma test, is to identify those data
points that lie outside the range of 99.73% probability of occurrence, as
potential outliers.
110
xi x
z0
sx
desired
the required number of
measurements is estimated by
112
Number of Measurements
113
Number of Measurements
114 t-table
Another Example
2.086
124
124
103
115
t-table
Monte Carlo Simulation
After N trials let N denote the number of points inside the circle.
TOT IN
117
Estimating 𝜋:
Monte Carlo Simulation
𝐴𝑐 𝜋
=
𝐴𝑠 4
118
𝜋 =3.14159265359
Example
MS Excel
119
Sample Results (100K iterations)
120
clc;clear;
% Known distributions of the input(s)
% Resistance in Ohm (normal distributon)
R_mean = 1000; R_std = 100;
% Current in Ampere (uniform distribution)
I_min = 0.095; I_max = 0.105;
% Monte Carlo stopping criterion (Tolerance)
TOL = input('Enter the tolerance for standard deviation check: ');
max_iter = input('Enter the maximum number of iterations: ');
% No input from the user?
if isempty(TOL), TOL = 1e-15; end;
if isempty(max_iter), max_iter = 1e5; end;
for i = 1:max_iter
I(i) = I_min + rand()*(I_max-I_min); %current
R(i) = normrnd(R_mean, R_std); %resistance
MCS(i) = I(i) * R(i); % voltage
MCS_std(i) = std(MCS);
if i > 1 && (abs(MCS_std(i) - MCS_std(i-1))/MCS_std(i-1)) < TOL
break;
end
end
% Total number of iterations
if i == max_iter
fprintf(1, '\nUser-specified maximum number of iterations (%d) has been reached!\n', i);
else
fprintf(1, '\nTotal number of iterations: %d\n', i);
end
fprintf(1, '\nTolerance value (TOL): %f\n', TOL);
figure();
str = sprintf('Number of iterations: %d: ', i); title(str);
subplot(3,1,1);
histogram(R); str = sprintf('Resistance (Ohm)\n Mean = %f Ohm; Stdev = %f Ohm', mean(R), std(R));
xlabel(str);
subplot(3,1,2);
histogram(I); str = sprintf('Current (A)\n Mean = %f A; Stdev = %f A', mean(I), std(I)); xlabel(str);
subplot(3,1,3);
histogram(MCS);str = sprintf('Voltage (V)\n Mean = %f V; Stdev = %f V', mean(MCS), std(MCS)); xlabel(str);
121
Another run… 100K iterations
122
Yet another run… 100K iterations
123