Lecture 1
Lecture 1
By
Dr. Abhijat Arun Abhyankar
Associate Professor
Course Contents
Sampling and data collection; Design of Experiment-complete
randomized design, randomized block design and Latin square
design. Probability distribution-Bayes rule, binomial, Poisson,
uniform, hypergeometric, multinormal, gamma, beta, chi-square
and F distribution; statistical estimation, hypothesis test-z , t, chi-
square and ANOVA; Multivariate analysis-multiple regression,
discriminant analysis, multidimensional scaling and factor
analysis; Research process-identification of problem, formulation
of objective, hypotheses; Research design characteristics of good
research desin ; measurement scale; Likert, semantic differential
and staple scale; Survey-types and techniques, questionnaire
designing, data analysis; writing and presentation of research
reports; ethics.
Session Plan
Lecture No. Topic to be covered
1 Introductory lecture
2 Normal distribution
3 Statistical estimation
4 Hypothesis testing
outliers
Degree of freedom
Correlation coefficient, r
The quantity r, called the linear correlation coefficient, as a
descriptive measure of the strength of a linear association between
two variables. The linear correlation coefficient is sometimes
referred to as the Pearson product moment correlation coefficient
in honor of its developer Karl Pearson. The mathematical formula
for computing r is:
Correlation Coefficient (contd.)
The value of r is such that -1 < r < +1.
The + and signs are used for positive linear correlations and negative linear correlations, respectively.
Positive correlation: If x and y have a strong positive linear correlation, r is close to +1. An r value of exactly +1
indicates a perfect positive fit. Positive values indicate a relationship between x and y variables such that as
values for x increases, values for y also increase.
Negative correlation: If x and y have a strong negative linear correlation, r is close to -1. An r value of exactly -1
indicates a perfect negative fit. Negative values indicate a relationship between x and y such that as values for x
increase, values for y decrease.
No correlation: If there is no linear correlation or a weak linear correlation, r is close to 0. A value near zero
means that there is a random, nonlinear relationship between the two variables.
Note that r is a dimensionless quantity; that is, it does not depend on the units employed.
A perfect correlation of 1 occurs only when the data points all lie exactly on a straight line. If r = +1, the slope of
this line is positive. If r = -1, the slope of this line is negative.
A correlation greater than 0.8 is generally described as strong, whereas a correlation less than 0.5 is generally
described as weak.
Coefficient of Determination, r2, R2
The coefficient of determination, r 2, gives the proportion of the variance
(fluctuation) of one variable that is predictable from the other variable.
The coefficient of determination represents the percent of the data that is the closest
to the line of best fit. For example, if r = 0.922, then r 2 = 0.850, which means that
85% of the total variation in y can be explained by the linear relationship between x
and y (as described by the regression equation). The other 15% of the total
variation in y remains unexplained.
Y = a + bX
Where, a intercept
And b is slope
Y-intercept, a = Y bX
Slope = XY-nXY/X2-nX2
Rank correlation
This is a measure of the correlation that exists between the two sets of ranks, a
measure of the degree of association between the variables that we would not have
been able to calculate otherwise.
We can compute a measure of association that is based on the ranks of the
observations, not the numerical values of the data.
This measure is called the Spearman rank correlation coefficient, in honor of the
statistician who developed it in the early 1900s.
Coefficient of
rank correlation rs = 1 - 6 d 2
n (n2 1)
where:
rs = coefficient of rank correlation (notice that the subscript s, from
Spearman, distinguishes this r from the one we calculated earlier)
n = number of paired observations
= notation meaning the sum of
d = difference between the ranks for each pair of observations
Problem
Ranking of eleven cities
CITY AIR-QUALITY PULMONARY-
RANK (1) DISEASE RANK (2)
A 4 5
B 7 4
C 9 7
D 1 3
E 2 1
F 10 11
G 3 2
H 5 10
I 6 8
J 8 6
K 11 9
Interpretation
Y X1 X2 X1 Y X2 Y X1 X2 X1 2 X2 2 Y2
(1) (2) (3) (2) x (1) (3) x (1) (2) x (3) (2)2 (3)2 (1)2
29 45 16 1,305 464 720 2,025 256 841
24 42 14 1,008 336 588 1,764 196 576
27 44 15 1,188 405 660 1,936 225 729
25 45 13 1,125 325 585 2,025 169 625
26 43 13 1,118 338 559 1,849 169 676
28 46 14 1,288 392 644 2,116 196 784
30 44 16 1,320 480 704 1,936 256 900
28 45 16 1,260 448 720 2,025 256 784
28 44 15 1,232 420 660 1,936 225 784
27 43 15 1,161 405 645 1,849 225 729
272 441 147 12,005 4,013 6,485 19,461 2,173 7,428
Step 1. Multiply Equation (1) by 441. Multiply Equation (2) by 10. Add (1)
to (2). This eliminates a and produces Equation (4).
(1) X (- 441): - 119,952 = - 4410a 194,481b1 64, 827b2
(2) X (10) : 120,050 = 4410a + 194,610b1 + 64,850b2
(4) : 98 = 129b1 + 23b2
Step 2. Multiply Equation (1) by 147 and Equation (3) by 10. Add (1) to (3).
This eliminates a and produces Equation (5).
(1) X (- 147): - 39,984 = - 1470a 64,827b1 21,609b2
(3) X 10 : 40,130 = 1470a + 64,850b1 + 21,730b2
(5) : 146 = 23b1 + 121b2
Step 3. Multiply Equation (4) by 23 and Equation (5) by 129. Add (4) to (5)
to eliminates b1. this Produces Equation (6), which can be solved for b 2.
(4) X (- 23): - 2,254 = - 2,967b1 529b2
(5) X (129): 18,834 = 2,967b1 + 15,609b2
(6) : 16,580 = 15,080b2
b2 = 1.099
Step 4. Find the value of b1 by substituting the value for b2 into equation (4):
(4): 98 = 129b1 + 23b2
98 = 129b1 + (23)(1.099)
98 = 129b1 + 25.277
72.723 = 10a
b1 = 0.564
Step 5. Substitute the values of b1 and b2 into Equation (1) to determine the
value of a:
(1): 272 = 10a + 441b1 + 147b2
272 = 10a + (441)(0.564) + (147)(1.099)
272 = 10a + 248.724 + 161.553
- 138.277 = 10a
a= - 13.828
Step 6. Substitute the values of a, b1, and b2 into the general two-variable
regression equation (Equation A). The resulting Equation (7) describes the
relationship among the number of field-audit labor hours, the number of
computer hours, and the unpaid taxes discovered by the auditing division.
= a + b1X1 + b2X2
Y X1 X2
2 1 0
8 3 4
5 2 1
6 3 3
12 5 3
19 8 8
Radio station WILD is Contemplating a new contest that will require listeners
to call the station and guess the identity of a secret spy. WILD hopes that the
contest will capture a larger share of the listening market. Prizes and the
number of times per day that calls will be accepted have yet to be determined.
The past 5 contests that WILD has run have yield the following data:
X1 Y
X2
NUMBER OF CALLS TOTAL % OF LISTNING MARKET
PRIZES ($) DURING CONTEST
PER DAY
15 15 39
8 3.5 23
19 5 28
24 10 35
10 1.5 23
a) Calculate the least squares equation that best relates these 3 variables.
b) If WILD takes 13 calls per day and each prize is worth $7.50, what market
share should be expected during the contest?
Coefficient of Multiple Regression
In this graph, we have indicated the areas between the regions as follows:
-1 Z 1 68.27%
-2 Z 2 95.45%
-3 Z 3 99.73%
This means that 68.27% of the scores lie within 1 standard deviation of the mean
Example compute the mean, median and mode
starting salary for the business college graduates
Percentile
The pth percentile is a value such that at least p percent of
the observations are less than or equal to this value and at
least (100-p) percent of the observations are greater than or
equal to this value
Calculating the pth percentile
Step 1: Arrange the data in ascending order (smallest to largest
value)
i= (P/100)n
Step 3: (a) If i is not an integer, round up. The next integer greater
than I denotes the position of the pth percentile.
(b) If i is an integer, the pth percentile is the average of the values
in position i and i+1
Problem: Calculate 85th and 50th
percentile for the starting salary data
Step 1: Arrange the data in ascending order
2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 3325
Step 2:
i=(p/100)n=(85/100)12=10.2
Step 3: Because i is not an integer round up. The position of the 85 th percentile is the
next integer greater than 10.2, the 11th position
50th percentile:
Step 2: i=(p/100)n=(50/100)12=6
Because i is an integer, the 50 th percentile is the average of the sixth and seventh values,
thus 50 th percentile is (2890+2920)/2= 2905. Note 50 th percentile is also the median
Quartiles
Quartiles are just specific percentiles thus steps for
computing percentiles can be applied directly in
computation of quartiles. The quartiles divides the values
into four parts
2710
2755
2850 Q1= 2865
2880
2880
2890 Q2= 2905 (Median)
2920
2940
2950 Q3= 3000
3050
3130
3325
Box plot
Draw a thin lines (whisker) from the 75th percentile up to the maximum
value.
Draw another thin line from the 25th percentile down to the minimum
value.
The length of the box in a box plot, i.e., the distance between the 25th and
75th percentiles, is known as the interquartile range. (IQR) You can use this
box length to detect outliers.
If any value in series is greater than 1.5 times the length of the box (out
of Q1 and Q3), then we have evidence of outliers.
Box plot of the starting salary data with lines showing the
lower and upper limits
e.g. starting salary data
Problem: Compute the sample standard deviation for starting salary data
Monthly salary (xi) Sample Deviation about the Squared deviation about the
mean ( x ) mean ( xi - x) mean (xi-x )2
2850 2940 -90 8100
2950 2940 10 100
3050 2940 110 12100
2880 2940 -60 3600
2755 2940 -185 34225
2710 2940 -230 52900
2890 2940 -50 2500
3130 2940 190 36100
2940 2940 0 0
3325 2940 385 148225
2920 2940 -20 400
2880 2940 -60 3600
301,850
Z score
Zi= Xi-X
S
54 10 10/8=1.25
42 -2 -2/8=-0.25
46 2 2/8=0.25
32 -12 -12/8=-1.50
Null and alternative hypothesis
Given the test scores of two random samples of men and women, does one group
differ from the other?
A possible null hypothesis is that the mean male score is the same as the mean
female score:
H0 : 1 = 2
where:
H0 = the null hypothesis
A stronger null hypothesis is that the two samples are drawn from the same
population, such that the variance and shape of the distributions are also equal
Type of Sampling
1 Random sampling
2 Systematic sampling
3 Stratified Sampling
4 Cluster Sampling
Simple Random Sampling
Systematic sampling assumes that the list is well shuffled, which means there are
no clear trends such as ordering data by age, income, or education or having a
characteristic at regular intervals. If there is an ordering, there is the danger of
selecting a biased sample.
Examples:
If the population consists of 1,000 people and we want to select a sample of 250,
we will take every fourth person on the list after selecting a random starting point
between 1 and 4.
This skip pattern is found by diving the population size by the sample size
(1,000/250 = 4).
Stratified Sampling
A stratified sample is a sampling technique in which the researcher
divided the entire target population into different subgroups, or strata,
and then randomly selects the final subjects proportionally from the
different strata.
For example, lets say the target population in a study was church
members in the United States. There is no list of all church members
in the country. The researcher could, however, create a list of
churches in the United States, choose a sample of churches, and then
obtain lists of members from those churches.