Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
81 views

L1 Introduction To Multivariate Analysis PDF

This document provides an introduction to multivariate analysis through a lecture on the topic. It discusses key concepts like variates, measurement error, validity and reliability. It also outlines a structured 6-stage approach to multivariate model building involving defining the problem, developing an analysis plan, evaluating assumptions, estimating the model, interpreting variates, and validating the model. Finally, it provides a classification of multivariate techniques based on whether variables are independent or dependent, the number of dependent variables, how variables are measured, and the research question. An example dataset on lifestyle is also presented to demonstrate these concepts.

Uploaded by

Si Hui Zhuo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

L1 Introduction To Multivariate Analysis PDF

This document provides an introduction to multivariate analysis through a lecture on the topic. It discusses key concepts like variates, measurement error, validity and reliability. It also outlines a structured 6-stage approach to multivariate model building involving defining the problem, developing an analysis plan, evaluating assumptions, estimating the model, interpreting variates, and validating the model. Finally, it provides a classification of multivariate techniques based on whether variables are independent or dependent, the number of dependent variables, how variables are measured, and the research question. An example dataset on lifestyle is also presented to demonstrate these concepts.

Uploaded by

Si Hui Zhuo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Multivariate analysis

WELCOME!

Lecture 1: Introduction
Introduction

- Bradley Efron
(1938 – )

2
Introduction

3
REVIEW

1. What is a Test of Hypothesis (TOH)?


2. What is 𝐻0 ? 𝐻𝐴 ?
3. What is a type I error?
4. What is a type II error?
5. What is ”α” in the TOH?
6. What is ”β” in the TOH?
7. What is the statistical power of a TOH?
8. What is a simple linear regression?
9. What are the assumptions of a simple linear regression?
10. What is the goal of an Analysis of Variance (ANOVA)?
11. What is a Confidence Interval (CI)?
12. What is a p-value?
13. How is the p-value used to make a conclusion in the TOH?
14. What is a simple random sample?
Review Table 1-1, page 10 of the text

Effect sizes: 0.2σ 0.5σ


Review
Consider the two-sample t-test with α = 0.05 and power = 0.80. For an
effect size of 𝑑, the sample size needed is* (σ = SD of each group):

𝝈 𝟐
𝒏 ≥ 𝟏𝟔 ×
𝒅
*Lehr, R.: Sixteen s-squared over d-squared: a relation for crude sample size estimates. Statistics in Medicine
11: 1099-1102, 1992.

Commonly used effect sizes are*: 𝒅= Effect size


0.2σ Small
0.5σ Medium
0.8σ Large

*Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, second edition, Academic Press, New York,
1988.
Review

Example: sample size and statistical power

Is the mean IQ (μ) the same for prisoners


as for nonprisoners? Test at α = 0.05.

𝐻0 : μ𝑝𝑟𝑖𝑠𝑜𝑛𝑒𝑟𝑠 = μ𝑛𝑜𝑛𝑝𝑟𝑖𝑠𝑜𝑛𝑒𝑟𝑠 versus 𝐻𝐴 : μ𝑝𝑟𝑖𝑠𝑜𝑛𝑒𝑟𝑠 ≠ μ𝑛𝑜𝑛𝑝𝑟𝑖𝑠𝑜𝑛𝑒𝑟𝑠

We wish to reject 𝐻0 with probability 0.80 if the true mean difference in IQ


between the two groups is 10 or more. It is known that the SD of IQ is
σ𝐼𝑄 = 20. (Note: power = 0.80, 𝑑 = 0.5σ.)

What sample size for each group should be taken to satisfy the
above conditions?
Review Table 1-1, page 10 of the text

About 65 0.8

σ 2 20 2
Sample size formula: 𝑛 ≥ 16 × = 16 × = 𝟔𝟒
𝑑 10
Multivariate Data

• Multivariate data analysis - introduction


• Basic concepts of multivariate analysis
• A structured approach to multivariate model building
• A classification of multivariate techniques
Multivariate Data

• Multivariate data analysis - introduction


• Basic concepts of multivariate analysis
• A structured approach to multivariate model building
• A classification of multivariate techniques
Multivariate Data
Simple Linear Regression: fit a line to a set of points in two
dimensions, 𝑌 = 𝑋.

11
Multivariate Data
Multiple Linear Regression: fit a plane to a set of points in
three dimensions, 𝑌 = 𝑋1 + 𝑋2 .

12
Multivariate Data
Multivariate Linear Regression:

𝑌1 𝑌2 𝑌3 … 𝑌𝑚 = 𝑋1 + 𝑋2 + 𝑋3 + … + 𝑋𝑝

13
The variate

The variate is the building block of multivariate analysis.


It is a linear combination of variables with empirically
determined weights.
The (conceptually related) variables are specified by the
researcher, and the weights are determined by the
multivariate technique.

Variate value = w1 X 1  w2 X 2  w3 X 3  ...  wn X n


Weight Observed variables
Determined by (n different variables)
multivariate technique Specified by researcher
The variate

Example. Variate:
𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 =

𝑤1 × 𝑠𝑝𝑟𝑖𝑛𝑡 𝑟𝑢𝑛 𝑠𝑐𝑜𝑟𝑒 + 𝑤2 × 𝑙𝑜𝑛𝑔 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑟𝑢𝑛 𝑠𝑐𝑜𝑟𝑒 +

𝑤3 × 𝑠𝑡𝑎𝑖𝑟 𝑐𝑙𝑖𝑚𝑏 𝑠𝑐𝑜𝑟𝑒


Multivariate measurements
Multivariate measurements, or summated scales, can be used when a
single measure is desired to represent a concept.
Several variables are joined in a composite measure.
Example. Three measures of satisfaction,

• overall satisfaction
• the likelihood to recommend
• the probability of purchasing again

can be combined into a summated scale:

𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑎𝑡𝑖𝑠𝑓𝑎𝑐𝑡𝑖𝑜𝑛 =

𝑤1 × 𝑠𝑎𝑡𝑖𝑠𝑓𝑎𝑐𝑡𝑖𝑜𝑛 + 𝑤2 × 𝑟𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑 + 𝑤3 𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒 𝑎𝑔𝑎𝑖𝑛


Measurement error

Measurement error : observed value ≠ true value

Sources of measurement error: data entry errors,


inaccurate information (e.g. household income is rarely
exactly provided), random fluctuation, dishonest
manipulation, etc.

Measurement error adds “noise” to the measured


variables, and the “true” effect is partly masked by the
measurement error.
Validity and reliability

Two measures used to assess measurement error:

• Validity: the degree to which a variable accurately


represents what it is intended to measure.

• Reliability: the precision with which a variable


measures a given quantity.

Example. Measure overall intelligence with a math test.

Is the math test valid? Is it reliable?


Validity and reliability: dart board analogy

• • • • •

• • •

• • • •

• • • •

• •
• • •

• • •
Validity and reliability: dart board analogy

• •
• • •
• ••

Validity and reliability: dart board analogy
• •


• •

• • •

• • •


• •

• • •
Validity and reliability: dart board analogy

• • •
•• •••
• ••
Multivariate Data

• Multivariate data analysis - introduction


• Basic concepts of multivariate analysis
• A structured approach to multivariate model building
• A classification of multivariate techniques
A structured approach to
multivariate model building
Stage 1: Define the research problem, objectives, and
multivariate technique to be used.

• Present the concept, idea, or topic

• Identify fundamental relationships to be investigated.

• If a dependence relationship is proposed, specify


dependent and independent concepts.

• Specify variables for each concept prior to the study.


A structured approach to
multivariate model building
Stage 2: Develop the Analysis Plan.
• Desired sample size, allowable types of variables,
estimation methods.

Stage 3: Evaluate the assumptions underlying the


multivariate technique.
• Evaluate assumptions, both statistical (normality,
linearity, etc.) and conceptual (model formulation and
types of relationships).
A structured approach to
multivariate model building
Stage 4: Estimate the multivariate model and assess
overall model fit.
• Estimate model, evaluate overall model fit.

• The model might be respecified to obtain an


acceptable model.
A structured approach to
multivariate model building
Stage 5: Interpret the variate(s).
• Reveal the nature of the relationship.

• May lead to additional respecifications of variables


and/or model formulation.
Stage 6: Validate the multivariate model.
• Use validation methods to perform a final set of
diagnostic analyses that assess the degree of
generalizability of the results.
Multivariate Data

• Multivariate data analysis - introduction


• Basic concepts of multivariate analysis
• A structured approach to multivariate model building
• A classification of multivariate techniques
A classification of multivariate techniques

Selection of the appropriate multivariate technique depends on the


answers to the following four questions:
1) Can the variables be divided into independent/explanatory (X)
classifications and dependent/response (Y) classifications based on
some theory?
2) If they can, how many variables are treated as dependent variables
in a single analysis?

3) How are the variables measured (both dependent and independent


variables): Metric or Nonmetric ?

4) What is the research question?


Example: Lifestyle data
Example. In a 2011 study of eating habits of 861 randomly
chosen Swedes, the following variables were investigated:

Variable Description Variable type


Gender Categorical
Age years Numerical
Education 1=Compulsory school < 9 years Ordered
2=Compulsory school 9 years categorical
3=High school  2 years (ordinal)
4=High school 3 years
5=College/university < 3 years (excl. graduate school)
6=College/university  3 years (excl. graduate school)
7=Graduate school
Example: Lifestyle data

Variable Description Variable type


Household kr/year Numerical
income
Smoking 1=Never smoked Ordinal
Habits 2=Stopped smoking
3=Occasionally
4=Daily
BMI Body Mass Index, kg/m2 Numerical
PAL Physical Activity Level, expresses a person’s daily Numerical
physical activity as a number, and is used to estimate a
person’s total energy expenditure
Example: Lifestyle data
Variable Description Variable type
Bread Total daily bread intake (g) Numerical
intake
Vegetable 1=Never Ordinal
intake 2=< 1 time/month
3=1 time/month
...
15=4 times/day or more
Candy 1=Never Ordinal
intake 2=< 1 time/month
3=1 time/month
...
15=4 times/day or more
Fiber Total daily fiber intake (g) Numerical
Alcohol Total daily alcohol intake (g) Numerical
Example: Lifestyle data

Example. Is there a relationship among the following variables?

• Age
• Education Independent
variables
• Physical activity level
• Candy intake
Dependent
• Body Mass Index
variable

1) Can the variables be divided into independent


classifications and dependent classifications based on some
theory?
Example: Lifestyle data

• Age
• Education
• Physical activity level
• Candy intake
One response
• Body Mass Index variable

2) How many variables are treated as dependent variables in


a single analysis?
Example: Lifestyle data

• Age (numerical)
• Education (ordinal)
• Physical activity level (numerical)
• Candy intake (ordinal)
• Body Mass Index (numerical)

3) How are the variables measured (both dependent and


independent variables)?
A classification of multivariate techniques

Dependence Techniques

# DVs, Type IV Type Method


1, M M&N Multiple regression
Several, M M&N Multivariate regression
1, N M&N Logistic regression
1, N M&N Discriminant analysis
Several, M N MANOVA
Several, M M&N MANCOVA

M = metric, N = Nonmetric
A classification of multivariate techniques

Interdependence Techniques

Relationships among: Type of variables Technique


VARIABLES M&N PCA
M&N EFA, CFA, SEM
CASES M&N Cluster Analysis
Exercise

Exercise.

Think of a concept or construct that cannot be adequately


measured with a single variable. Formulate ≥ 4 variables
(indicators) that, taken together, accurately represent the
construct of interest.
Example: Restaurant Survey

Example. Following are Likert scale (degree of agreement)


survey items for restaurant customers with scale 1 (disagree) – 10 (agree).

1. Our server was respectful


2. The interior has a pleasant decor
3. The noise level was unacceptably high
4. Our server was very friendly
5. Our meals tasted delicious
6. Everything was very clean
7. The seating area was too dark
8. The main dishes were not hot enough
9. Our server was very attentive
10.The room temperature was very comfortable

Discuss dimension reduction for this survey.


Example: Restaurant Survey

Example. Following are Likert scale (degree of agreement)


survey items for restaurant customers with scale 1 (disagree) – 10 (agree).

1. Our server was respectful.


2. The interior has a pleasant decor
3. The noise level was unacceptably high
4. Our server was very friendly
5. Our meals tasted delicious
6. Everything was very clean
“SERVER”
7. The seating area was too dark
8. The main dishes were not hot enough
9. Our server was very attentive
10.The room temperature was very comfortable
Example: Restaurant Survey

Example. Following are Likert scale (degree of agreement)


survey items for restaurant customers with scale 1 (disagree) – 10 (agree).

1. Our server was respectful.


2. The interior has a pleasant decor
3. The noise level was unacceptably high
4. Our server was very friendly
5. Our meals tasted delicious
6. Everything was very clean
“PHYSICAL”
7. The seating area was too dark
8. The main dishes were not hot enough
9. Our server was very attentive
10.The room temperature was very comfortable
Example: Restaurant Survey

Example. Following are Likert scale (degree of agreement)


survey items for restaurant customers with scale 1 (disagree) – 10 (agree).

1. Our server was respectful.


2. The interior has a pleasant decor
3. The noise level was unacceptably high
4. Our server was very friendly “FOOD”
5. Our meals tasted delicious
6. Everything was very clean
7. The seating area was too dark
8. The main dishes were not hot enough
9. Our server was very attentive
10.The room temperature was very comfortable
Example: Restaurant Survey
Three variates may be able to replace the 10 survey items without losing
much information:

Variate Measured by averaging items:

SERVER 1, 4, 9

PHYSICAL 2, 3*, 6, 7*, 10

FOOD 5, 8*

*response must be reverse ordered: 𝑦 = 10 − 𝑦

You have just carried out an informal, “mental” Exploratory Factor


Analysis (EFA).
Example: Restaurant Survey

A possible research question that can be addressed by these data is:

which aspects of customer satisfaction influence whether the customer


returns to the restaurant?

Logistic regression analysis:

Repeat Customer? (Yes, No) = SERVER PHYSICAL FOOD


Typical Data Structure

Variables
Cases/ 𝒀𝟏 𝒀𝟐 𝒀𝟑 … 𝒀𝒎 𝑿𝟏 𝑿𝟐 𝑿𝟑 … 𝑿𝒑
Observations/
Subjects
1 67.3 4.2 158.6 … M 0.114 …

2 71.3 … M …

3 65.8 … F …

… … …

n 68.7 … M …
Multivariate Statistical Methods

In most every multivariate statistical method there are three


issues that must be considered:

• Missing data

• Outliers

• Multicollinearity
Missing data

Data layout:
Variables
Cases 𝑿𝟏 𝑿𝟐 𝑿𝟑 𝑿𝟒
1 72.5 M 3.44 1456
2 . F 2.72 . “.” = missing value
3 54.8 M 3.94 2127
4 65.8 M . 1548

Important considerations:

• percentage of missing values

• pattern of missing values (missing at random?)


Missing data

Major breakthroughs in the handling of missing data came in the 1970s with the advent of

• Maximum likelihood estimation

and

• Multiple imputation

Availability of these methods in statistical packages did not occur until the 1990s.
Missing data

General guidelines:

(i) Any variable or case with 50% or more missing values should be deleted.

(ii) Under 10% missing data for an individual case can generally be ignored if the
missingness is random.

(iii) If less than 25% of the data are missing, then parameter estimates will generally be
accurate*.

*Demeritus, H, Freels, S.A., & Yucel, R.M. (2008). Plausibility of multivariate normality assumption when multiple imputing non-
Gaussian continuous outcomes: a simulation assessment. Journal of Statistical Computation and Simulation, 78, 69-84.
Missing data
SIMPLE METHODS FOR HANDLING MISSING DATA

Pairwise deletion: calculations based on all available observations.

Listwise deletion: calculations based only on complete cases (cases with nonmissing values for all variables).

Variables
Case 𝑿𝟏 𝑿𝟐 𝑿𝟑 𝑿𝟒
1 72.5 M . 227
2 68.3 F 2.72 185
3 . M 3.94 178
4 59.7 M 3.54 278

Example:
pairwise deletion: 𝑋ത1 = 66.833 is based on 3 observations, 𝑋ത4 = 217.0 is based on 4
observations.
listwise deletion: all variable means are based on 2 observations.
Outliers
(i) Never omit an outlier without a strong justification for doing
so.

(ii) Univariate outlier detection for metric variables:


𝒙−𝒙ഥ
let z = standardized score = , then
𝒔

≤ 80 observations: z ≥ 2.5 => outlier

> 80 observations: z ≥ 4.0 => outlier


Multicollinearity

(i) What is it? Very high correlation among variables.

(ii) Why is it bad? It depends on the analysis. In regression,


for example, multicollinearity leads to extremely high
(inflated) variances for the parameter estimates.
Multicollinearity, continued

(iv) How is it detected?

Pairwise correlations: if any two variables have a correlation coefficient


higher than 0.9 in absolute value, then multicollinearity exists*.

Variance Inflation Factor (VIF)**:


VIF ≥ 10 =>RED LIGHT (almost certain multicollinearity)

8 ≤ VIF < 10 =>YELLOW LIGHT (proceed with caution)

VIF < 8 =>GREEN LIGHT (may proceed without fear of


multicollinearity)

*Tabachnik, B.G. and Fidell, L.S., Using Multivariate Statistics, 5th edition, 2007, page 88.
**Myers, R.H.: Classical and Modern Regression With Applications, Second Edition, PWS-Kent Publishing Co., Boston, MA., 1990, page 369.
Multicollinearity, continued

(v) What to do if it exists?

Simplest solutions:

· drop one of the two variables that is highly correlated

· combine the two variables in a meaningful way to create


a single variable. For example, if HEIGHT and WEIGHT
are highly correlated then use BODY MASS INDEX =
𝑊𝐸𝐼𝐺𝐻𝑇
𝑘 × , where 𝑘 = multiplicative constant that
𝐻𝐸𝐼𝐺𝐻𝑇 2
depends on the units of measurement.
Rules of Thumb

Generally, throughout this course “Rules of


Thumb” for such things as sample size
guidelines, goodness of fit measures, etc. will
be presented in

RED

You might also like