Assign 1revised2
Assign 1revised2
Assign 1revised2
Assignment #1
1. Why would someone use multivariate statistics instead of several univariate
analyses? What are the benefits and drawbacks of a multivariate analysis?
Choosing multivariate or univariate statistics is therefore based upon the
researchers question. Multivariate statistics allows the researchers to compare
multiple variables in a treatment study, whereas univariate statistics examines
categorical and/or quantitative variables separately. Multivariate statistics allows
us to examine the relationships of many variables, and can include a mixture of
quantitative and qualitative/categorical variables. Although we can understand
multiple relationships in univariate statistics by using several univariate analyses,
it is much more efficient to use multivariate statistics. Multiple univariate
analyses address questions about individual outcome variables but the
interrelationships among the outcome variables are not demonstrated.
Multivariate analyses address the questions about overall effects present and
studying multiple systems/ variables (effects, comparisons, order, differences,
contributions, & identifying underlying constructs).
The benefits of multivariate statistics are numerous and varied:
multivariate statistics gives more complete and detailed descriptions of
variables as it allows for the researcher to investigate (compare) multiple
criterion measures.
As subjects respond differently to treatments and are often associated with
a high cost (financial, time, effort, etc.), using multivariate statistics is an
efficient and effective way of gaining information from one study rather
than conducting multiple univariate analyses
Multivariate methods allow researchers to examine a much more complex
arrays of variables simultaneously, with results providing less error and
more validity than examining in isolation
Additional benefits include: greater flexibility and options for analyses, better
understanding of published research, the ability to examine large sets of data
while controlling for error and considering correlations, specific assessment
measures targeted at determining whether analyses are behaving as expected,
and positive attitudes towards statistics based on a better understanding.
Furthermore, the benefit of using multivariate analysis over repeated univariate
tests are:
reduced type I error,
the ability to examine correlations between dependent variables,
better detection of variable differences, (within group and between group)
amount of contribution to the overall significance of each variable
Leora Fisher
Drawbacks of using multivariate statistics are typically related to the size and
complexity of multivariate designs. Three specific problematic issues are
identified as such:
the statistical assumptions of the linear model must be met in the
multivariate method,
larger numbers of participants are needed,
the interpretation of results may be difficult given the complexity of
output and results.
Other drawbacks described in the various readings include the fact that
multivariate methods may be challenging to learn given their complexity, and that
the broader focus requires more expansive thinking.
2. Why are assumptions so important in statistics? What do assumptions give us?
For example, when conducting an ANOVA what assumptions must we make and
how do they effect the conclusions that we make? What makes the assumptions of
multivariate statistics so challenging to ascertain?
Assumptions provide the basis for performing any statistical measures. For
example, when conducting an ANOVA (or any linear measure), the assumptions
are that the independent variables are at the nominal (categorical) level. The
groups are mutually exclusive, and the variances between groups are equal
(homogeneity of variance). The dependent variables are continuous, and fall
within a normal distribution.
Assumptions are also critical when considering errors. In linear regression, it is
assumed that errors:
Are independent one error does not influence another
Follow a normal distribution
Have a constant variance - they do not vary more widely at any point in
the distribution
It is necessary to assess assumptions of the error distribution. Researchers must
examine the assumptions related to the relationship between two variables
(residuals vs. the fitted values). Secondly, the assumption of normally distributed
errors is accomplished by examining how closely data (p.p. plots) follows the 45degree line. As well, checking the assumption of equal variances is necessary.
Within-group, between- group and total variance affect our assumptions, and must
be accounted for. If this assumption is violated, use a more conservative alpha
level for determining significance for that variable such as .025 or .01 rather than
the typical .05 level.
Leora Fisher
2 1
A = 3 0
7 5
4 2
B = 4 2
4 2
1 0
C =
0 1
AB
2 1 4 2 2 1
3 0 - 4 2 = 1 2
7 5 4 2 3
3
b) B-A
B-A
1
4 2 2 1 2
4 2 - 3 0 = 1
2
4 2 7 5 3 3
c) A+B
A+B
2 + 4 1 + 2 6 3
3 + 4 0 + 2 =
7 2
7 + 4 5 + 2 11 7
1.0 .9
D =
.9 1.0
4 2 1
E = 3 3 2
2 4 5
Leora Fisher
C
2 1
3 0 + 1 0 = not defined (because matrices have different dimensions)
0 1
7 5
e) CD
1 0
C =
0 1
1.0 .9
D =
.9 1.0
!1.0$
&
".9 %
= 1(1.0) + 0(.9)=1.0
!.9 $
(1,0) # &
"1.0%
= 1(.9) + 0(1.0) = .9
!1.0$
&
".9 %
= 0 (1.0) + 1(.9) = .9
!.9 $
&
"1.0%
C(D)11 = (1,0) #
C(D)12 =
C(D)21 = (0,1) #
C(D)22 = (0,1) #
! 1.0 .9 $
CD = #
&
" .9 1.0 %
f) C'D
As it is not possible to divide matrices, it is necessary to multiply C by the inverse
of D.
! 1 0 $
" 1.0 .9 %
C' = #
&
$
'
" 0 1 %
# .9 1.0 &
! 1 0 $
C' = #
&
" 0 1 %
Leora Fisher
CD =
G)
! 4 2 1 $
E =# 3 3 2 &
#
&
# 2 4 5 &
"
%
Element
Minors
Cofactor
Element x Cofactor
a11 = 4
!32 $
e11 = #
& = 15-8=7
"45 %
4x7 = 28
a12 = 2
!32 $
e12 = #
& = 15-4=11
"25 %
-11
2x-11= -22
a13 = 1
!33 $
e13 = #
& = 12-6=6
"24 %
6x1 = 6
Cofactor
Element x Cofactor
Therefore
E = 28 + (-22) + (6) = 12
h) Inverse of E
As above, plus:
Element
Minors
a22 = 2
!41 $
e22 = #
& = 20-2=18
"25%
18
2x18=36
a23 = 2
!42 $
e23 = #
& = 16-4=12
"24 %
12
2x12=24
a24 = 5
!42$
e24 = #
& = 12-6=6
"33 %
5x6= 30
Leora Fisher
2
$ 12
$ 11 3
E' = $
2
$ 12
$ 1
1
$ 2
#
1
2
5
12
1
2
%
'
'
'
'
'
'
'
&
4. What does matrix D tell us about the relationship between the two variables
represented in the matrix? What kind of matrix is this?
Matrix D is a Correlation Matrix (R). This is a [p x p] square and symmetrical
matrix, with 1.0 on the diagonal and the off-diagonals showing the correlations
between the variables (i.e. magnitude and direction of the relationship between
the variables). Matrix D indicates a strong positive correlation between the
variables at .9.
5. What kind of matrix is matrix B?
Matrix B is a matrix of means (an square n x p data matrix).
6. What kind of matrix is matrix C?
Matrix C is a 2x2 Diagonal Matrix, specifically an Identity Matrix of the second
order.
7. Why is it important to have a non-zero determinant?
The determinant interprets the generalized variance for a set of variables. If it is a
zero it means that none of the variance can be explained by the other variables
(they are completely uncorrelated). Non zero determinants are invertible, and the
inverse of a matrix can only exist if the determinant is not zero (as with division
one cannot divide a number by zero). Additionally, you would not be able to
calculate the variance for between group and within group variability without a
determinant. Lastly, if variables are redundant, eliminating redundant variables
will correct corresponding errors.
8. Prepare the attached data set for multivariate analyses (base this on the data
cleaning attachment that I included). Describe the key features of the data
(outliers, missing data, distributional forms, homoscedasticity, multicollinearity
etc), and decide what should be done to this data to prepare it for analysis.
Imagine that these data were collected from two groups, group 1 was the control
group and group 2 received a reading intervention. Scores range from 0 to 200.
That is all that you need to know about the data.
Leora Fisher
Valid
blending
99
75
Reading
Reading words
words into
comprehension
per minute
syllables
91
92
93
Leora Fisher
24
Mean
1.4545
108.6185
64.0038
76.2793
61.6135
Std. Deviation
.50046
36.10021
27.52857
35.98013
19.78342
C) Univariate Outliers:
There were several outliers as indicated on the box plot below, but the extreme outliers
are explained by the out-of-range data, indicating errors in the data. The exception is
subject 94 (Syllabication). That outlier falls within the range of 0-200. As indicated by
the stars (*) there are 3 serious outliers (15 in reading comp, 55 in wpm, & 94 dividing
words) that will impact the analyses and lead to problems in generalizability. As well,
scores 29 & 66 (in blending) need to be considered as they are out of range. As there are
only a few outliers, they can be examined individually to discover why they are extreme.
To address the outliers, there are three steps:
1. check the accuracy of the data collection and entry, and errors can be then
corrected. Although the data in question is the result of a errors, an examination of
the variables shows that no one variable is responsible for most of the outliers (they
exist on all 4 at some level), therefore all variables may be kept
2. determine if the outliers are a part of the population, and are deleted or kept
depending on whether they are part of the intended population
3. reduce the impact of the outliers by deciding to either transform or changing the
scores
2. Missing Data
Missing data is within an acceptable range for all areas with the exception of blending.
The percent of missing data is approximately 9% in reading comprehension, 8% in
reading words per minute, and 6% in dividing words. In these cases, you could impute
the data using a linear interpretation (the mean for that variable). However, the missing
Leora Fisher
Reading
Reading words
words into
comprehension
per minute
syllables
Valid
Missing
blending
91
92
93
75
24
3. Linear Relationships
As illustrated by the graphs below, a linear distribution cannot be assumed between all
variables. In checking assumptions for the regression model, the points on a scatterplot
should be scattered. If the data points are not scattered over the graph (as with the
scatterplots below), it is necessary to reconsideration the assumptions of normality in
error. Graph b indicates a large amount of correlation between two dependent variables
(blending and syllabication). As previously alluded to, this is a trend that becomes
apparent throughout the analysis. The blending variable continues to be problematic in
the statistical measures completed, and would need to be dealt with.
It will also be necessary to identify the outliers on y using standardized residuals and the
outliers on x to look for hat elements (hg) outside of 0 and 1.
a.
Leora Fisher
c.
d.
Leora Fisher
4. Normality
Skewness and kurtosis values that are close to zero indicate normal distribution. The
skewness and kurtosis values for all the variables are elevated demonstrating a nonnormal distribution. The histograms shows that distribution is offset and there are
obvious outliers. Therefore, an assumption of normality would be erroneous here.
Elimination of problematic data or transformation of data using a y-squared root function
are the best possible ways to deal with this problem.
Statistics
Valid
Missing
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Reading
Ability to divide
comprehension
minute
blending
group
91
92
93
75
99
24
4.765
2.452
3.633
-1.385
.185
.253
.251
.250
.277
.243
28.890
15.602
25.391
2.411
-2.007
.500
.498
.495
.548
.481
Leora Fisher
5.
Mahalanobis Distance
Cumulative
Frequency
Missing
Percent
Valid Percent
Percent
35.93451
1.0
1.7
96.7
46.41977
1.0
1.7
98.3
54.33985
1.0
1.7
100.0
Total
60
60.6
100.0
System
39
39.4
99
100.0
Total
Skewness and Kurtosis values are at a considerable distance from which indicates that
distribution is not normal. An examination of the Mahalanobis Distance frequency chart
in SPSS shows that there are 4 multivariate outliers outside the critical range of 18.73 as
indicated in the Stevens (2009) chart (35.93, 46.42, & 54.34). Again, elimination of
Blending as a variable would correct the data so that the outliers do not affect the
Mahalanobis Score.
Cook's Distance
Cumulative
Frequency
Missing
Total
Percent
Valid Percent
Percent
1.18532
1.0
1.7
96.7
1.73923
1.0
1.7
98.3
19.22560
1.0
1.7
100.0
Total
60
60.6
100.0
System
39
39.4
99
100.0
Leora Fisher
Ability to divide
words into
syllables
minute
comprehension
Pearson Correlation 1
.476**
.270*
.108
Sig. (2-tailed)
.000
.024
.378
70
70
69
.314**
.071
.003
.518
93
88
86
.140
blending
blending
75
Sig. (2-tailed)
.000
syllables
70
Reading words
.314**
per minute
Sig. (2-tailed)
.024
.003
70
88
92
85
1
.203
Reading
.071
.140
comprehension
Sig. (2-tailed)
.378
.518
.203
69
86
85
6.
91
Multicollinearity
Several variables were identifed as having a somewhat significant correlation to one
another (dividing words/ words per minute; blending/ words per minute; blending/
dividing words). This means that these variables may not be independently accounting
for variability in this model. In other words, when multicollinearity is high, it is difficult
to ascertain which of the independent variables are influencing the dependent variable,
and to what extent. This severely limits the size of R and R squared. It reduces the
predictive power of the regression equation as it is uncertain which variable is
contributing to the dependent variable. Multicollinearity is caused by too much
redundancy in the variables, and can be checked by one of two methods. As indicated
above, I chose to use a bivariate correlation method. This involves looking for
correlations above approximately a .7 level.
Leora Fisher