Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Assign 1revised2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Leora Fisher

EDPS 612.02 - Psychological Measurement and Statistics

Assignment #1
1. Why would someone use multivariate statistics instead of several univariate
analyses? What are the benefits and drawbacks of a multivariate analysis?
Choosing multivariate or univariate statistics is therefore based upon the
researchers question. Multivariate statistics allows the researchers to compare
multiple variables in a treatment study, whereas univariate statistics examines
categorical and/or quantitative variables separately. Multivariate statistics allows
us to examine the relationships of many variables, and can include a mixture of
quantitative and qualitative/categorical variables. Although we can understand
multiple relationships in univariate statistics by using several univariate analyses,
it is much more efficient to use multivariate statistics. Multiple univariate
analyses address questions about individual outcome variables but the
interrelationships among the outcome variables are not demonstrated.
Multivariate analyses address the questions about overall effects present and
studying multiple systems/ variables (effects, comparisons, order, differences,
contributions, & identifying underlying constructs).
The benefits of multivariate statistics are numerous and varied:
multivariate statistics gives more complete and detailed descriptions of
variables as it allows for the researcher to investigate (compare) multiple
criterion measures.
As subjects respond differently to treatments and are often associated with
a high cost (financial, time, effort, etc.), using multivariate statistics is an
efficient and effective way of gaining information from one study rather
than conducting multiple univariate analyses
Multivariate methods allow researchers to examine a much more complex
arrays of variables simultaneously, with results providing less error and
more validity than examining in isolation
Additional benefits include: greater flexibility and options for analyses, better
understanding of published research, the ability to examine large sets of data
while controlling for error and considering correlations, specific assessment
measures targeted at determining whether analyses are behaving as expected,
and positive attitudes towards statistics based on a better understanding.
Furthermore, the benefit of using multivariate analysis over repeated univariate
tests are:
reduced type I error,
the ability to examine correlations between dependent variables,
better detection of variable differences, (within group and between group)
amount of contribution to the overall significance of each variable

Leora Fisher

EDPS 612.02 - Psychological Measurement and Statistics

make predictions about independent variables and their effect on the


dependent variable

Drawbacks of using multivariate statistics are typically related to the size and
complexity of multivariate designs. Three specific problematic issues are
identified as such:
the statistical assumptions of the linear model must be met in the
multivariate method,
larger numbers of participants are needed,
the interpretation of results may be difficult given the complexity of
output and results.
Other drawbacks described in the various readings include the fact that
multivariate methods may be challenging to learn given their complexity, and that
the broader focus requires more expansive thinking.
2. Why are assumptions so important in statistics? What do assumptions give us?
For example, when conducting an ANOVA what assumptions must we make and
how do they effect the conclusions that we make? What makes the assumptions of
multivariate statistics so challenging to ascertain?
Assumptions provide the basis for performing any statistical measures. For
example, when conducting an ANOVA (or any linear measure), the assumptions
are that the independent variables are at the nominal (categorical) level. The
groups are mutually exclusive, and the variances between groups are equal
(homogeneity of variance). The dependent variables are continuous, and fall
within a normal distribution.
Assumptions are also critical when considering errors. In linear regression, it is
assumed that errors:
Are independent one error does not influence another
Follow a normal distribution
Have a constant variance - they do not vary more widely at any point in
the distribution
It is necessary to assess assumptions of the error distribution. Researchers must
examine the assumptions related to the relationship between two variables
(residuals vs. the fitted values). Secondly, the assumption of normally distributed
errors is accomplished by examining how closely data (p.p. plots) follows the 45degree line. As well, checking the assumption of equal variances is necessary.
Within-group, between- group and total variance affect our assumptions, and must
be accounted for. If this assumption is violated, use a more conservative alpha
level for determining significance for that variable such as .025 or .01 rather than
the typical .05 level.

Leora Fisher

EDPS 612.02 - Psychological Measurement and Statistics


Assumptions inform our hypothesis for example, whether to use a one tailed test
vs. two tailed test. For example, a one-tailed test is used when we are testing a
variable in which the rejection region lies in one direction/ one tail of the sample
distribution. If we should be using a two-tailed test and our assumptions are
faulty, so will be our hypothesis, invalidating the statistical measures.
Assumptions in multivariate statistics are difficult to ascertain as there are so
many factors to consider. However, without consideration of all of these factors,
our assumptions, inferences and conclusions will be faulty. Weak assumptions
negatively influence inferences and application of data.
3. Given these matrices:

2 1
A = 3 0
7 5

4 2
B = 4 2
4 2

1 0
C =

0 1

Calculate the following (show your work):


a) A-B

AB

2 1 4 2 2 1
3 0 - 4 2 = 1 2

7 5 4 2 3
3
b) B-A

B-A

1
4 2 2 1 2
4 2 - 3 0 = 1
2



4 2 7 5 3 3
c) A+B

A+B
2 + 4 1 + 2 6 3
3 + 4 0 + 2 =

7 2
7 + 4 5 + 2 11 7

1.0 .9
D =

.9 1.0

4 2 1
E = 3 3 2

2 4 5

Leora Fisher

EDPS 612.02 - Psychological Measurement and Statistics


d) A+C

C
2 1
3 0 + 1 0 = not defined (because matrices have different dimensions)

0 1

7 5
e) CD

1 0
C =

0 1

1.0 .9
D =

.9 1.0
!1.0$
&
".9 %

= 1(1.0) + 0(.9)=1.0

!.9 $
(1,0) # &
"1.0%

= 1(.9) + 0(1.0) = .9

!1.0$
&
".9 %

= 0 (1.0) + 1(.9) = .9

!.9 $
&
"1.0%

= 0(.9) + 1 (1.0) = 1.0

C(D)11 = (1,0) #
C(D)12 =

C(D)21 = (0,1) #
C(D)22 = (0,1) #

! 1.0 .9 $
CD = #
&
" .9 1.0 %
f) C'D
As it is not possible to divide matrices, it is necessary to multiply C by the inverse
of D.
! 1 0 $
" 1.0 .9 %
C' = #
&
$
'
" 0 1 %
# .9 1.0 &

! 1 0 $
C' = #
&
" 0 1 %

Leora Fisher

EDPS 612.02 - Psychological Measurement and Statistics


" 1.0 .9 %
$
'
# .9 1.0 &

CD =

G)
! 4 2 1 $

E =# 3 3 2 &
#
&
# 2 4 5 &
"
%

Element

Minors

Cofactor

Element x Cofactor

a11 = 4

!32 $
e11 = #
& = 15-8=7
"45 %

4x7 = 28

a12 = 2

!32 $
e12 = #
& = 15-4=11
"25 %

-11

2x-11= -22

a13 = 1

!33 $
e13 = #
& = 12-6=6
"24 %

6x1 = 6

Cofactor

Element x Cofactor

Therefore

E = 28 + (-22) + (6) = 12

h) Inverse of E
As above, plus:
Element

Minors

a22 = 2

!41 $
e22 = #
& = 20-2=18
"25%

18

2x18=36

a23 = 2

!42 $
e23 = #
& = 16-4=12
"24 %

12

2x12=24

a24 = 5

!42$
e24 = #
& = 12-6=6
"33 %

5x6= 30

Leora Fisher

EDPS 612.02 - Psychological Measurement and Statistics


" 7
1
$

2
$ 12
$ 11 3
E' = $
2
$ 12
$ 1
1
$ 2
#

1
2
5

12
1
2

%
'
'
'
'
'
'
'
&

4. What does matrix D tell us about the relationship between the two variables
represented in the matrix? What kind of matrix is this?
Matrix D is a Correlation Matrix (R). This is a [p x p] square and symmetrical
matrix, with 1.0 on the diagonal and the off-diagonals showing the correlations
between the variables (i.e. magnitude and direction of the relationship between
the variables). Matrix D indicates a strong positive correlation between the
variables at .9.
5. What kind of matrix is matrix B?
Matrix B is a matrix of means (an square n x p data matrix).
6. What kind of matrix is matrix C?
Matrix C is a 2x2 Diagonal Matrix, specifically an Identity Matrix of the second
order.
7. Why is it important to have a non-zero determinant?
The determinant interprets the generalized variance for a set of variables. If it is a
zero it means that none of the variance can be explained by the other variables
(they are completely uncorrelated). Non zero determinants are invertible, and the
inverse of a matrix can only exist if the determinant is not zero (as with division
one cannot divide a number by zero). Additionally, you would not be able to
calculate the variance for between group and within group variability without a
determinant. Lastly, if variables are redundant, eliminating redundant variables
will correct corresponding errors.
8. Prepare the attached data set for multivariate analyses (base this on the data
cleaning attachment that I included). Describe the key features of the data
(outliers, missing data, distributional forms, homoscedasticity, multicollinearity
etc), and decide what should be done to this data to prepare it for analysis.
Imagine that these data were collected from two groups, group 1 was the control
group and group 2 received a reading intervention. Scores range from 0 to 200.
That is all that you need to know about the data.

Leora Fisher

EDPS 612.02 - Psychological Measurement and Statistics


1. Descriptive Statistics
A) Out of Range Values
A visual scan of the frequency tables in SPSS shows that there were a few out of range
values. These out of range values could be checked against original data as the data set is
not too large and there were relatively few errors. Errors could be then corrected prior to
further analyses. Two variables fell below the base score of 0 (Blending), one fell above
the range ceiling of 200 at a score of 250 (Reading Comprehension), and there was one
score below 0 and one above 200 in the area of WPM. Errors in the data can be addressed
in one of three ways:
Leave it unchanged The most conservative course of action is to accept this data as a
valid response and make no change to it. The larger your sample size, the less that one
response will affect the analysis; the smaller your sample size, the more likely it will have
an impact.
Correct the data Imputation, replacing the answers with imputed values, such as the
mean for that variable, is possible, and could be recommended for Blending as the
percentage of missing data will affect data interpretation.
Delete the data If the erroneous data seems illogical and the value is so far from the
mean that it will affect the statistics, it may be possible to delete either delete the
response in question or the entire record. Again, this is a strong possibility for Blending
because, as we will see later, Blending impacts the distribution of error, as well as the
Cooks and Mahalanobis measures.
Variable
Read comp
Blending
Blending
Wpm
Wpm

Out of range value


250
-13.35
-15.20
-3.82
300

B) Plausible means & Standard Deviations


There may be unequal variance in reading words per minute and in blending as there is a
higher amount of variance than the other variables, and is something to watch closely
through the remaining analyses.
Statistics
Ability to divide
group
N

Valid

blending
99

75

Reading

Reading words

words into

comprehension

per minute

syllables

91

92

93

Leora Fisher

EDPS 612.02 - Psychological Measurement and Statistics


Missing

24

Mean

1.4545

108.6185

64.0038

76.2793

61.6135

Std. Deviation

.50046

36.10021

27.52857

35.98013

19.78342

C) Univariate Outliers:
There were several outliers as indicated on the box plot below, but the extreme outliers
are explained by the out-of-range data, indicating errors in the data. The exception is
subject 94 (Syllabication). That outlier falls within the range of 0-200. As indicated by
the stars (*) there are 3 serious outliers (15 in reading comp, 55 in wpm, & 94 dividing
words) that will impact the analyses and lead to problems in generalizability. As well,
scores 29 & 66 (in blending) need to be considered as they are out of range. As there are
only a few outliers, they can be examined individually to discover why they are extreme.
To address the outliers, there are three steps:
1. check the accuracy of the data collection and entry, and errors can be then
corrected. Although the data in question is the result of a errors, an examination of
the variables shows that no one variable is responsible for most of the outliers (they
exist on all 4 at some level), therefore all variables may be kept
2. determine if the outliers are a part of the population, and are deleted or kept
depending on whether they are part of the intended population
3. reduce the impact of the outliers by deciding to either transform or changing the
scores

2. Missing Data
Missing data is within an acceptable range for all areas with the exception of blending.
The percent of missing data is approximately 9% in reading comprehension, 8% in
reading words per minute, and 6% in dividing words. In these cases, you could impute
the data using a linear interpretation (the mean for that variable). However, the missing

Leora Fisher

EDPS 612.02 - Psychological Measurement and Statistics


data for blending is almost 33% which is considered too high. The first step to address
missing data is to check the pattern of the missing data. If the pattern is random and the
percentages were small you could use an expectation maximization (EM) methods if the
pattern is random or impute the data using a regression equation to predict the missing
values if the out of range values were corrected, however given that the missing data for
blending is high these options are not appropriate for this situation. Also, using a
multiple data correlation would not be optimal as one variable has considerable missing
data (missing data is not scattered evenly across variables and the missing value is high).
In this case, multiple imputations is most likely necessary to minimize the impact of this
missing data. This method makes no assumptions if the data is missing randomly and it
retains sampling variability. The final method targeted at addressing the missing data is
to repeat the analyses both with and without the missing data. This is a viable option as
an imputation method is going to be used, the number of missing values is high, and the
data set is not large.
Statistics
Ability to divide

Reading

Reading words

words into

comprehension

per minute

syllables

Valid
Missing

blending

91

92

93

75

24

3. Linear Relationships
As illustrated by the graphs below, a linear distribution cannot be assumed between all
variables. In checking assumptions for the regression model, the points on a scatterplot
should be scattered. If the data points are not scattered over the graph (as with the
scatterplots below), it is necessary to reconsideration the assumptions of normality in
error. Graph b indicates a large amount of correlation between two dependent variables
(blending and syllabication). As previously alluded to, this is a trend that becomes
apparent throughout the analysis. The blending variable continues to be problematic in
the statistical measures completed, and would need to be dealt with.
It will also be necessary to identify the outliers on y using standardized residuals and the
outliers on x to look for hat elements (hg) outside of 0 and 1.
a.

Leora Fisher

EDPS 612.02 - Psychological Measurement and Statistics


b.

c.

d.

Leora Fisher

EDPS 612.02 - Psychological Measurement and Statistics

4. Normality
Skewness and kurtosis values that are close to zero indicate normal distribution. The
skewness and kurtosis values for all the variables are elevated demonstrating a nonnormal distribution. The histograms shows that distribution is offset and there are
obvious outliers. Therefore, an assumption of normality would be erroneous here.
Elimination of problematic data or transformation of data using a y-squared root function
are the best possible ways to deal with this problem.

Statistics

Valid
Missing

Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis

Reading

Reading words per

Ability to divide

comprehension

minute

words into syllables

blending

group

91

92

93

75

99

24

4.765

2.452

3.633

-1.385

.185

.253

.251

.250

.277

.243

28.890

15.602

25.391

2.411

-2.007

.500

.498

.495

.548

.481

Leora Fisher

EDPS 612.02 - Psychological Measurement and Statistics

5.
Mahalanobis Distance
Cumulative
Frequency

Missing

Percent

Valid Percent

Percent

35.93451

1.0

1.7

96.7

46.41977

1.0

1.7

98.3

54.33985

1.0

1.7

100.0

Total

60

60.6

100.0

System

39

39.4

99

100.0

Total

Skewness and Kurtosis values are at a considerable distance from which indicates that
distribution is not normal. An examination of the Mahalanobis Distance frequency chart
in SPSS shows that there are 4 multivariate outliers outside the critical range of 18.73 as
indicated in the Stevens (2009) chart (35.93, 46.42, & 54.34). Again, elimination of
Blending as a variable would correct the data so that the outliers do not affect the
Mahalanobis Score.
Cook's Distance
Cumulative
Frequency

Missing
Total

Percent

Valid Percent

Percent

1.18532

1.0

1.7

96.7

1.73923

1.0

1.7

98.3

19.22560

1.0

1.7

100.0

Total

60

60.6

100.0

System

39

39.4

99

100.0

Leora Fisher

EDPS 612.02 - Psychological Measurement and Statistics


The Cooks Distance chart in SPSS indicates that there are 3 data points that are above 1
which indicate they are influential data points, and that present concern. Two values fall
slightly above the recommended limit of 1, (1.19, 1.74), while one falls significantly
outside the limit of Cooks Distance at 19.23. Elimination of Blending will correct the
Cooks Distance Measure.

Ability to divide
words into

Reading words per Reading

syllables

minute

comprehension

Pearson Correlation 1

.476**

.270*

.108

Sig. (2-tailed)

.000

.024

.378

70

70

69

.314**

.071

.003

.518

93

88

86

.140

blending
blending

75

Ability to divide Pearson Correlation .476**


words into

Sig. (2-tailed)

.000

syllables

70

Reading words

Pearson Correlation .270*

.314**

per minute

Sig. (2-tailed)

.024

.003

70

88

92

85
1

.203

Reading

Pearson Correlation .108

.071

.140

comprehension

Sig. (2-tailed)

.378

.518

.203

69

86

85

6.

91

Multicollinearity
Several variables were identifed as having a somewhat significant correlation to one
another (dividing words/ words per minute; blending/ words per minute; blending/
dividing words). This means that these variables may not be independently accounting
for variability in this model. In other words, when multicollinearity is high, it is difficult
to ascertain which of the independent variables are influencing the dependent variable,
and to what extent. This severely limits the size of R and R squared. It reduces the
predictive power of the regression equation as it is uncertain which variable is
contributing to the dependent variable. Multicollinearity is caused by too much
redundancy in the variables, and can be checked by one of two methods. As indicated
above, I chose to use a bivariate correlation method. This involves looking for
correlations above approximately a .7 level.

Leora Fisher

EDPS 612.02 - Psychological Measurement and Statistics


To reduce the impact of multicollinearity, it is necessary to use semipartial correlations to
determine how much impact each independent variable has on the dependent variables.
This allows us to use a calclulation to independently analyze each independent x variable
by controlling for the influence of the other x variables.

You might also like