Rm Unit 4 - Overview
Rm Unit 4 - Overview
1
UNIT IV
Syllabus
• Summarizing the Data: Mean, Median,
Mode and Standard Deviation
• Data Analysis Techniques: Univariate and
Bivariate Analysis (Chi Square, ANOVA, Sign
test); Multivariate Analysis (Discriminant
Analysis, Cluster Analysis, Factor Analysis,
Multiple Linear Regression).
2
Central Tendency
3
Central Tendency
Mean Mode
Median
4
Deviation
• Mean Deviation
• Standard Deviation
5
Univariate and Bivariate Analysis
Univariate Analysis
Univariate Analysis refers to the analysis of one variable like average weight of
employees in an organisation. Here, there is no relationship of this variable with
any other variable.
Bivariate Analysis
Bivariate Analysis refers to the analysis of two variable like Age and Weight of
Employees. Here, the correlation can be determined between the two variables.
6
Data Analysis and Data Type
Goodness of Fit
Chi Square Test
(Non-Parametric Test) Independent
Nominal/ Categorical Data Attributes
Single Sample
Small Sample
Sign Test
Paired Samples
(Non-Parametric Test)
Single Sample
Ordinal/ Interval/ Ratio Scaled Data Large Sample
Paired Samples
7
Chi-Square: General Structure
• For analysis of categorical data
– Test for equality of percentages (goodness of Fit)
– Test for independence
• The chi-square statistic measures the difference between
the actual counts and the expected counts (assuming
validity of null hypothesis) as follows:
n
Oi E i
2
2
sta t Ei
i 1
When we need to find out if the two more qualitative facets are
independent
8
Test for Goodness of Fit
Example Coin
A coin is tossed for 50 times and head appears to be 30 times. Test at five
percent level of significance if the coin is unbiased.
9
Solution
H0: The coin is unbiased
H1: The coin is biased n
2 stat
Oi E i 2
i1 Ei
df n 1
H 30 25 1
T 20 25 1
2
Vc =2, Vt = 3.841
Vc <Vt
H0 is accepted
The coin is unbiased.
10
Chi Square Independent Attributes
Synergy Ltd. is organising a training programme for its 500 employee to improve
performance. Some attend the programme while others not. The observation is as
follows:
11
H0 : Training is not effective.
H1 : Training is effective.
View Improve Not Improve Row Total
Attend 132 91 223
Not Attend 140 137 277
Column Total 272 228 Grand Total = 500
(Row Total)X(Column Total)
Expected Frequency
Grand Total 223 272
a11 221.31
df (c 1)(r 1); Where c Column, r Row 500
223 228
a12 101.69
Calculation of Expected Frequency 500
277 272
View Improve Not Improve a 21 150.69
Attend a11 a12 500
Not Attend a21 a22 277 228
a 22 126.31
500
12
O E (O – E)2 (O – E)2/E
132 121.31 114.23 0.94
91 101.69 114.23 1.12
140 150.69 114.23 0.76
137 126.31 114.23 0.90
3.73
2 3.73
Vc = 3.73
df (c 1)(r 1) Vt = 3.841
df (2 1) (2 1) Vc < Vt
df 1 H0 is accepted.
The training is not effective.
Table Value 2 at 5% for 1 df 3.841
13
ANOVA: General Structure
• One-Way ANOVA
Variance SS DF MS F-ratio F-table
Between Sample SSB k-1 MSB=SSB/(k-1) MSB/MSW
Within Sample SSW n-k MSW=SSW/(n-k)
Total SST n-1
• Two-Way ANOVA
Variance SS DF MS F-ratio F-table
Between Columns SSC c-1 MSC=SSC/(c-1) MSC/MSE
Between Rows SSR r-1 MSR=SSR/(r-1) MSR/MSE
Residual Error SSE (c-1)(r-1) MSE=SSE/(c-1)(r-1)
Total SST n-1
14
One-Way ANOVA
Variance SS DF MS F-ratio F-table
Between Sample SSB V1 = k – 1 MSB=SSB/V1
MSB/MSW
Within Sample SSW V2 = N – k MSW=SSW/V2
Total SST N–1
T Sum Total
T2
Correction Factor
N
SST SSB SSW
SSB
X X X
2 2 2
...
X 2
T2
1 2 3 n
n1 n2 n3 nn N
n T2
SST ( X i ) 2
i1 N
df v1, v 2
v1 k 1
v2 N k
15
Example
Healthy Agro Ltd. sows three samples of three kinds of seeds in a farm. The
productivity in tons is observed as follows:
16
H0: There is no difference in the seeds.
H1: There is difference in the seeds.
Seed Seed Seed S12 S22 S32
1 2 3
5 7 18 25 49 324
5 7 18 25 49 324
7 12 18 49 144 324
SUM 17 26 54 99 242 972
Total 97 Total 1313
N X i2
SST X i CF
k
T 97 2
SSB CF
2 i1 ni
CF T i1
18
Two-Way ANOVA
Variance SS DF MS F-ratio
Between
SSC V1 = c-1 MSC=SSC/(c-1) MSC/MSE
Columns
Between Rows SSR V1 = r-1 MSR=SSR/(r-1) MSR/MSE
Residual Error SSE V2 = (c-1)(r-1) MSE=SSE/(c-1)(r-1)
Total SST N-1
T S u m To t a l
2
C o r r e c t i o n F a c t o r C F
T df v 1 , v 2
N
SST SSC SSR SSE v1 (For Column) c 1
n
v1 (For Row) r 1
SST ( Xi )2 CF
i 1
v 2 (c 1)(r 1)
X
2
X
2
X
2
X
2
SSC ... CF
1 2 3 n
n1 n2 n3 nn
Y1 Y3 Yn
2 2 2 2
Y2
SSR ... CF
n1 n2 n3 nn
19
Example
Three brands of detergents have been used in three water
temperatures to wash similar kinds of cloths. The cleanliness is observed
as follows:
Surf Excel Tide Wheel
Cold 5 7 18
Normal 7 12 21
Warm 10 14 25
20
H01: There is no significant difference in cleanliness due to the water temperatures.
21
Surf Excel (X1) Tide (X2) Wheel (X3) Total X12 X22 X32
Cold (Y1) 5 7 18 30
25 49 324 398
Normal (Y2) 7 12 21 40 49 144 441 634
Warm (Y3) 10 14 25 49 100 196 625 921
Total 22 33 64 119 1953
T 119
119 2 Y Y Y
2 2 2
X X X SSR CF
1 2 3
SSC
CF 1573.44
2 2 2
9
1 2 3
CF n1 n2 n3
n
n1 n2 n3
SST ( X i ) CF
2 2 2
2 (30) (40) (49)
SSR 1573.44
2 2 2
(22) (33) (64)
SSC 1573.44 3 3 3
i1 3 3 3
SSC 316.22 SSR 60.22
SST 19531573.44
SST 379.55 SSE SST (SSC SSR)
SSE 3.12
22
Source of
SS df MS F F crit
Variation
Fc = MSC/MSE
Columns SSC = 316.22 V1(C) = 2 MSC = (316.22)/(2) = 158.11 6.94
= (158.11)/(0.78) = 203.29
Fr = MSR/MSE
Rows SSR = 60.22 V1(r) = 2 MSR = (60.22)/(2) = 30.11 6.94
(30.11)/(0.78) = 38.71
23
Sign Test
Small Samples
Single Sample
24
p q r n C r p r q nr
Solution
H0: Median = 15
H1: Median ≠ 15
P = 2(20C2p2q18+20C1p1q19+20C0p0q20)
1.00 – 8.90 –
Cr n
2.00 – 9.00 – n
rn r
3.00 – 9.30 –
C2 20
4.00 – 9.70 – 20
220 2
5.00 – 12.00 –
201918
6.00 – 12.25 – 20
C2
218
6.70 – 14.25 – 20
C2 190
7.00 – 14.45 –
7.10 – 18.00 + =2((190x(0.5)20)+(20x(0.5)20)+(1x(0.5)20)
7.25 – 19.00 + =0.00040245
P<0.05
+ = 2, – = 18 H0 is rejected
Median ≠ 15
25
Sign Test
Small Sample
Paired Sample
In an institute two scientists Mr. Goldsworthy and Mr. Sheraton develop
two methods of giving training to the new employees. Since it is a typical
type of data for a very limited type of employees, the data is not supposed
to be normally distributed which is as follows:
Sr. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Method1 20 24 28 24 20 29 19 27 20 30 18 28 26 24
Method2 16 26 18 17 20 21 23 22 23 20 18 21 17 26
26
Solution
H0: Median 1 = Median 2
H1: Median 1 ≠ Median 2
Sr. No. Method1 Method2 d
1 20 16 - P = 2(12C4p4q8+12C3p3q9+12C2p2q10 +12C 1p1q11+12C0p0q12)
2 24 26 + = 0.3876953125
3 28 18 - P > 0.05
4 24 17 -
5 20 20 =
H0 is accepted
6 29 21 - Median 1 = Median 2
7 19 23 + There is no significant difference between the two
8 27 22 -
methods.
9 20 23 +
10 30 20 -
11 18 18 =
12 28 21 -
13 26 17 -
14 24 26 +
+ = 04, – = 08, = = 02
27
Sign Test
For Large Samples
x - np o x - o.5 n
Z Or Z
np 0 (1- p0) o.5 n
28
Sign Test
For Large Samples Single Sample
Gannet India Ltd. estimates the age of its employees to be 45 years. A sample
of 100 employees is taken out of which 60 are above 45 years, 8 are of 45 years and
32 are less than 45 years of age. Test if the age median of the employees of the
company is 45 years.
Solution
H0: Med = 45
H1: Med ≠ 45 Z
(X 0.5) o.5 n V c 2.81
o.5 n
V t 1.96
Here (32 0.5) o.5 92
Z Vc Vt
n 100 8 92 o.5 92
Z
13.5 H 0 is rejected
X 32 4.795 Median is not 45 years
p 0.5 Z 2.81
29
Sign Test
For Large Samples Paired Samples
Gannet India Ltd. provides training to its 100 employees. The findings are as follows:
Improved = 70
Worse than Previous = 25
No Change = 5
Test if the training programme is effective.
Solution
H0: The training programme is not effective.
H1: The training programme is effective.
V c 4.51
(X 0.5) o.5 n
Here Z V t 1.645
o.5 n
n 100 5 95 (25 0.5) o.5 95
Vc Vt
Z
X 25 o.5 95
H 0 is rejected
The training programme is effective.
p 0.5 Z
22
4.87
Z 4.51
30
Factor Analysis
It is used
• Checking Reloads
31
Factor Analysis
It discovers how the original variables are organised in a particular way
reflecting another ‘latent variable’
Factor Analysis
32
• Most often it is used in multivariate technique of research
studies, particularly in social and behavioural sciences
33
Construct = Latent Variable = Factor
34
Error
X1 e1
Degree of
ei (I = 1, 2, 3, … ) is the unique
Correlation X2 e2 contribution of the variable
that cannot be predicted from
the remaining variables. It is
X3 e3 equal to 1 – R2.
Intelligence X4 e4
X5 e5
X6 e6
X7 e7
35
Factor Analysis
Why do we look at “dimensions”?
• We study phenomena that can not be directly observed
– (ego, personality, intelligence)
• We have too many observations
– need to “reduce” them to a smaller set of factors
• Items are representations of underlying or latent factors.
– We want to know what these factors are.
– We have an idea of the phenomena that a set of items represent (construct
validity).
• To find underlying latent constructs
– As manifested in individual items
• To assess the association between these factors
• To produce usable scores that reflect critical aspects of any complex
phenomenon
– (e.g. personality, intelligence, values, air)
• An end in itself and a major step toward creating error free measures
36
Factor Analysis
Basic Concept
• BUT suppose one is just a little better than the other at representing this
underlying phenomena?
• FACTOR ANALYSIS looks for the phenomena underlying the observed variance
and covariance in a set of variables.
• These phenomena are called “factors” or “principal components.”
37
Condition
• For only Interval or Ratio Scaled Data
• Usually Sample Size should be five times higher than total
number of variables.
Adequacy
• Bertlett’s Chi Square p value should be less than 0.05
• KMO Statistic should be greater than 0.5
• Determinant of Correlation Matrix R > 0.00001 (Field 2012 p771)
KMO Interpretation
> 0.90 Marvellous
0.80 – 0.90 Meritorious
0.70 – 0.79 Middling
0.60 – 0.69 Mediocre
0.50 – 0.59 Miserable
< 0.50 Unacceptable
38
Statistics Associated with Factor Analysis
• Bartlett's test of sphericity. Bartlett's test of sphericity is a test statistic used
to examine the hypothesis that the variables are uncorrelated in the
population. In other words, the population correlation matrix is an identity
matrix; each variable correlates perfectly with itself (r = 1) but has no
correlation with the other variables (r = 0).
• The Bartlett Test of Sphericity compares the correlation matrix with a matrix of
zero correlations (technically called the identity matrix, which consists of all
zeros except the 1’s along the diagonal).
39
2 p 5
n 1
2
x ln R
6
p( p 1)
df
2
where
n Sample Size
p Number of Variables
ln Natural Log (log(x,2.71828))
R Deter min ant of Correlation Matrix
40
Kaiser-Meyer-Olkin Measure of Sampling Adequacy
ij
r 2
j i
KMO i
ij ij
r 2
a 2
i j i i j i
Where
a Partial Correlation
Vij
aij
Vii.V jj
Vij Inverse Matrix of r
Vii Diagonal of Vij
V jj Transpose of Vii
41
Role of Correlation in Factor Analysis
• Holzinger and Swineford (1939) – Variables sets must be clustered with high Correlation
• Tabachmick and Fidell (2001) – Maximum Correlation must be greater than 0.3
• Correlation matrix. A correlation matrix is a lower triangle matrix showing the simple
correlations, r, between all possible pairs of variables included in the analysis. The
diagonal elements, which are all 1, are usually omitted.
Name of Matrix Elements Good Signs Bad Signs
Many above 0.3 and
Correlation R Correlations Few above 0.3
possible clustering
Few above 0.3 and
Partial correlation Partial correlation Few above 0.3
possible clustering
Partial correlation Few above 0.3 and
Anti Image Few above 0.3
reversed possible clustering
42
X1 X2 X3 X4 X5 X6 X7
X1 1.000 0.770 0.810 0.210 0.180 0.190 0.210
X2 0.770 1.000 0.870 0.250 0.170 0.210 0.220
X3 0.810 0.870 1.000 0.180 0.210 0.240 0.410
X4 0.210 0.250 0.180 1.000 0.270 0.240 0.210
X5 0.180 0.170 0.210 0.270 1.000 0.870 0.900
X6 0.190 0.210 0.240 0.240 0.870 1.000 0.870
X7 0.210 0.220 0.410 0.210 0.900 0.870 1.000
43
Elements in Principal Component Method
E i
Total Variance Explained 1
n
44
• Factor loadings. Factor loadings are simple correlations between the variables and the factors. It is
correlation between a specific observed variable and a specific factor. Higher values mean a closer
relationship. They are equivalent to standardised regression coefficients (β weights) in multiple regression.
Higher the value the better.
• Communality. Communality is the amount of variance a variable shares with all the other variables being
considered. This is also the proportion of variance explained by the common factors. It is the total influence
on a single observed variable from all the factors associated with it. It is individual variability of the variable
and equal to the sum of all the squared factor loadings for all the factors of Eigenvalue greater than 1 related
to the observed variable and this value is the same as R2 in multiple regression. Higher the value the better.
1 – Communality of a variable is not explained or predicted by the model.
k
F1 F2 F3 --- Fk
h X
2 2
ik
X1 X11 X12 X13 --- X1k i1
• Factor loading plot. A factor loading plot is a plot of the original variables using the factor loadings as
coordinates.
• Factor matrix. A factor matrix contains the factor loadings of all the variables on all the factors extracted.
45
• Rotation
Rotation is selected as per the nature of interrelations of variables. It is of two types:
• Oblique; and
• Orthogonal
• Oblique
If the variables are assumed to be dependent or related, Oblique Rotation is
selected. It consists of Direct Oblimin or Promax.
X Y θ is constant
θ may be or -
X Y Z
θ θ θ
46
• Orthogonal
If the variables are assumed to be Independent, Orthogonal Rotation is selected.
It consists of Varimax, Quatrimax and Equimax.
Y
Z
47
Rotated Component Matrix
The Rotated Component Matrix identifies the group of similar variables
with the latent factor. In initial Eigenvalues the difference of Eigenvalues between
the factors is higher than in rotated loadings.
Transformation Matrix
It explains how much a particular factor is rotated. For example, if the
value is 0.707, it means the rotation is 450 because Cos 450 = 0.707.
Cos X Excel
CosRadians X
Cos 1 X
ACOS X * 180 / PI
48
Eigenvalue in Matrix
Eigenvalue can be determined by λ, when for matrix A with Vector V
A I V 0 Where
A Square Matrix
I Identity Matrix
V Vectors
Eigenvalue in Factor Analysis is the sum of the squares of coefficients of a particular factor.
F1 F2 F3 --- --- Fi
X1 X11 X12 X13 --- --- X1i
X2 X21 X22 X23 --- --- X2i
7
X3 X31 X32 X33 --- --- X3i
Eigenvalue for Factor1 X i12
X4 X41 X42 X43 --- --- X4i i1
X5 X51 X52 X53 --- --- X5i
X6 X61 X62 X63 --- --- X6i
X7 X71 X72 X73 --- --- X7i
49
Multiple Regression Analysis
• A procedure for analyzing associative relationships between a metric
dependent variable and one or more independent variable
– Existence of a relationship
– Strength of the relationship
– Predict the values of the dependent variable
– Control for other independent variable when evaluating the contributions
of a specific variable or a set of variables
50
Multiple Regression
Multiple Regression allows us to:
Use several variables at once to explain the variation in a continuous
dependent variable.
Isolate the unique effect of one variable on the continuous dependent
variable while taking into consideration that other variables are affecting it
too.
Write a mathematical equation that tells us the overall effects of several
variables together and the unique effects of each on a continuous
dependent variable.
Control for other variables to demonstrate whether bivariate relationships
are spurious
51
Regression Analysis
Deterministic Model
Yi = β0 + β1X1 + β2X2 + β3X3 + … + βiXi
Probabilistic Model
Yi = β0 + β1X1 + β2X2 + β3X3 + … + βiXi + μ
52
Multiple Regression Analysis
The general form of the multiple regression model is as follows:
The coefficient a represents the intercept, but the b's are the partial regression
coefficients i.e. slope.
53
Cluster Analysis
Age Income
Gender
54
Cluster Analysis
Introduction
• A technique used to classify objects or cases into relatively homogeneous
groups called clusters
55
Cluster Analysis
Application
• Market segmentation based on benefits sought by the customers
56
Statistics Associated with Cluster
Analysis
• Agglomeration schedule. An agglomeration schedule gives information on
the objects or cases being combined at each stage of a hierarchical clustering
process.
• Cluster centroid. The cluster centroid is the mean values of the variables for
all the cases or objects in a particular cluster.
• Cluster centers. The cluster centers are the initial starting points in
nonhierarchical clustering. Clusters are built around these centers, or seeds.
57
Discriminant Analysis
Introduction
• Need to classify people into two or more groups
– Buyers / Non-Buyers
– Good / Bad credit risk
– Superior / Average / Poor Products
• Goal
– To establish a procedure to find the predictors that best classify subjects
• Uses
– Market segmentation research
58
Discriminant Analysis
Introduction
• Dependent variable is categorical (nominal or non metric)
– Nominal: Gender, Religion
– Promotion: Low, Medium, High
• Predictor variable is interval in nature
• Involves deriving a variate, the linear combination of the two (or more)
independent variables that will discriminate best between a priori defined
groups
• Hypothesis: Group means of a set of independent variables for two or more
groups are equal
59
Discriminant Analysis
Objectives
• Development of discriminant functions, or linear combinations of the predictor or
independent variables, which will best discriminate between the categories of the
criterion or dependent variable (groups).
• Examination of whether significant differences exist among the groups, in terms of the
predictor variables.
• Classification of cases to one of the groups based on the values of the predictor
variables.
60
Discriminant Analysis
Discriminant Function
• Discriminant Analysis is done by calculating a linear function of the form
Di = d0 + d1X1 + d2X2 + d3X3 + . . . + dpXp
Where
– Di is the score on discriminant function i.
– The di’s are weighting coefficients; do is constant.
– The X’s are the values of the discriminating variable used in the analysis
• No. of discriminant equations required
– Two groups – One; Three groups – Two; N groups – N-1 equations
61
Discriminant Analysis
Good 5 5 5
Bad 7 5 7
Normal 3 5 7
62