Analisis Jalur
Analisis Jalur
Analisis Jalur
PATH ANALYSIS
In such a scenario, the model becomes complex and path analysis comes handy in
such situations. Path analysis is an extension of multiple regression. It allows for the
analysis of more complicated models. It is helpful in examining situations where
there are multiple intermediate dependent variables and in situations where Z is
dependent on variable Y, which in turn is dependent on variable X. It can compare
different models to determine which one best fits the data.
Path analysis was earlier also known as ‘causal modeling’; however, after strong
criticism people refrain from using the term because it’s not possible to establish
causal relationships using statistical techniques. Causal relationships can only be
established through experimental designs. Path analysis can be used to disprove a
model that suggests a causal relationship among variables; however, it cannot be
used to prove that a causal relation exist among variables.
Let’s understand the terminology used in the path analysis. We don’t variables as
independent or dependent here; rather, we call them exogenous or endogenous
variables. Exogenous variables (independent variables in the world of regression)
are variables which have arrows starting from them but none pointing towards them.
Endogenous variables have at least one variable pointing towards them. The reason
for such a nomenclature is that the factors that cause or influence exogenous
variables exist outside the system while, the factors that cause endogenous
variables exist within the system. In the above image, X is an exogenous variable;
while, Y and Z are endogenous variables. A typical path diagram is as shown below.
In the above figure, A, B, C, D and E are exogenous variables; while, I and O are
endogenous variables. ‘d’ is a disturbance term which is analogous to residuals in
regression.
Now, let’s go through the assumptions that we need to consider before we use path
analysis. Since, path analysis is an extension of multiple regression, most of
assumptions of multiple regression hold true for path analysis as well.
1. All the variables should have linear relations among each other.
2. Endogenous variable should be continuous. In case of ordinal data, minimum
number of categories should be five.
3. There should be no interaction among variables. In case of any interaction, a
separate term or variable can be added that reflects the interaction between the
two variables.
4. Disturbance terms are uncorrelated or covariance among the disturbance terms is
zero.
Now, let’s move a step ahead and understand the implementation of path analysis in
R. We will first try out with a toy example and then take a standard dataset available
in R.
install.packages("lavaan")
install.packages("OpenMx")
install.packages("semPlot")
install.packages("GGally")
install.packages("corrplot")
library(lavaan)
library(semPlot)
library(OpenMx)
library(GGally)
library(corrplot)
Now, let’s create our own dataset and try out path analysis. Please note that the rationale for
doing this exercise is to develop intuition to understand path analysis.
For examples:
# Let's create our own dataset and play around that first
set.seed(11)
a = 0.5
b = 5
c = 7
d = 2.5
x1 = rnorm(20, mean = 0, sd = 1)
x2 = rnorm(20, mean = 0, sd = 1)
x3 = runif(20, min = 2, max = 5)
Y = a*x1 + b*x2
Z = c*x3 + d*Y
data1 = cbind(x1, x2, x3, Y, Z)
head(data1, n = 10)
Now, we have created this dataset. Let’s see the correlation matrix for these variables. This
will tell us how strongly and which all variables are correlated to each other.
The above chart shows us that Y is very strongly correlate with X2; while, Z is
strongly correlated with X2 and Y. The impact of X1 on Y is not as strong as that of
X2.
model1 = 'Z ~ x1 + x2 + x3 + Y
Y ~ x1 + x2'
fit1 = cfa(model1, data = data1)
summary(fit1, fit.measures = TRUE, standardized = TRUE, rsquare = TRUE)
Number of observations 20
Estimator ML
Model Fit Test Statistic NA
Degrees of freedom NA
P-value NA
Parameter Estimates:
Information Expected
Information saturated (h1) model Structured
Standard Errors Standard
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Z ~
x1 0.721 NA 0.721 0.072
x2 0.328 NA 0.328 0.028
x3 1.915 NA 1.915 0.179
Y 1.998 NA 1.998 0.867
Y ~
x1 0.500 NA 0.500 0.115
x2 5.000 NA 5.000 0.968
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.Z 14.773 NA 14.773 0.215
.Y 0.000 NA 0.000 0.000
R-Square:
Estimate
Z 0.785
Y 1.000
> summary(fit2)
lavaan (0.6-1) converged normally after 62 iterations
Number of observations 32
Estimator ML
Model Fit Test Statistic 7.901
Degrees of freedom 3
P-value (Chi-square) 0.048
Parameter Estimates:
Information Expected
Information saturated (h1) model Structured
Standard Errors Standard
Regressions:
Estimate Std.Err z-value P(>|z|)
mpg ~
hp -0.022 0.016 -1.388 0.165
gear 0.586 1.247 0.470 0.638
cyl -0.848 0.710 -1.194 0.232
disp 0.006 0.012 0.512 0.609
carb -0.472 0.620 -0.761 0.446
am 1.624 1.542 1.053 0.292
wt -2.671 1.267 -2.109 0.035
hp ~
cyl 7.717 6.554 1.177 0.239
disp 0.233 0.087 2.666 0.008
carb 20.273 3.405 5.954 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.mpg 5.011 1.253 4.000 0.000
.hp 644.737 161.184 4.000 0.000
In the above summary output, we can see that wt is a significant variable for mpg at
5 percent level; while, dsp and crb are significant variables for hp. ‘Hp’ itself is not a
significant variable for mpg. We will examine this model using a path diagram using
semPlot package.
> semPaths(fit2, 'std', 'est', curveAdjacent = TRUE, style = "lisrel")
The above plot shows that mpg is strongly dependent on wt; while, hp is strongly
dependent on dsp and crb. There is a weak relation between hp and mpg. Same
inference was derived from the above output.
semPaths function can be used to create above chart in multiple ways. You can go
through the documentation for semPaths and explore different options.
There are few considerations that you should keep in mind while doing path analysis.
Path analysis is very sensitive to omission or addition of variables in the model. Any
omission of relevant variable or addition of extra variable in the model may
significantly impact the results. Also, path analysis is a technique for testing out
models and not building them. If you were to use path analysis in building models,
then you may end with endless combination of different models and choosing the
right model may not be possible. So, path analysis can be used to test a specific
model or compare multiple models to choose the best possible.
There are numerous other ways you can use path analysis. We would love to hear
your experiences of using path analysis in different contexts. Please share your
examples and experiences in the comments section below.
path analysis is structural equation modeling (SEM). There are a few packages to do
SEM in R, like: lavaan, SEM.
a simple example;
x3 affects both x1 and x2 and x2 affects x1
##############R-code##############
library(lavaan)
model1<-'x3 ~ x1 + x2
x2 ~ x1'
fit1<-sem(model1)
#Summary of the fitted model
summary(fit1)
#check the coefficients
coef(fit1)
#and as dataframe
parameterEstimates(fit1)
############end R-code############
for more details and examples on lavaan package
try http://users.ugent.be/~yrosseel/lavaan/lavaanIntroduction.pdf
MATERI KE-2
How to Run Path Analysis with R
For this path analysis practice exercise, I continue to use the election
data I used in the previous post. Instead of using some datasets that I
am not quite familiar with, using my own data really helps make my
learning experience more relatable and personal,
In other words, age, sex, race, education, and income are specified to
predict party affiliation and political interest, respectively. Then, party
affiliation and political interest, respectively, are specified to predict
support for Trump. It might be that people who are older, males,
Caucasians, less educated, and less rich may lean toward Republicans
and have more interest in this election, which in turn predict support for
Donald Trump.
Disclaimer: the model I outline here is not based on any theory. It’s more
of a post-hoc model. When I first ran path analysis, I included political
ideology in the model as another mediating variable. But, this model did
not fit the data well. For some reason, when I removed political ideology,
the model fit the data well. So, I just decided to use the above model for
pretty much this reason. It’s always good to see good fit indices!
Now, with lavaan, it looks like I have to first store a model in a new
variable, which I label model1. Each mediator and the final outcome
variable are placed on the left-hand side, followed by tilde (~). Then, I
place predictors of each mediator and the outcome variable on the right-
hand side. A model is enveloped with single quotes (‘ ‘). So, I type the
following
1 model1 <- 'party ~ age + sex + race + educ + inco
2 inter ~ age + sex + race + educ + inco
3 suppt ~ party + inter'
When I run this code, R will store results in another new variable that I
create, results1. To see the results stored in results1, I use the summary ()
function and enter results1 inside the parentheses.
1 summary(results1)
> summary(results1)
Used Total
Nu m b e r o f o b s e r v a t i o n s 630 677
Estimator ML
Mi n i m u m F u n c t i o n T e s t Statistic 13.542
De g r e e s o f f r e e d o m 6
P- v a l u e ( C h i- s q u a r e ) 0.035
Parameter Estimates:
Information Expected
St a n d a r d E r r o r s S t a n d ard
Regressions:
Es t i m a t e S t d . E r r z- v alue P(>|z|)
pa r t y ~
ag e 0 . 0 5 6 0 . 0 4 7 1 . 1 8 5 0.236
se x 0 . 2 3 9 0 . 1 6 0 1 . 4 9 1 0.136
ra c e - 1 . 1 8 8 0 . 1 8 5 - 6 .408 0.000
ed u c 0 . 1 4 5 0 . 0 5 1 2 . 8 1 9 0.005
in c o - 0 . 1 2 5 0 . 0 3 9 - 3 .228 0.001
in t e r ~
ag e 0 . 1 8 1 0 . 0 4 4 4 . 1 4 4 0.000
se x - 0. 1 8 5 0 . 1 4 8 - 1 . 248 0.212
ra c e - 0 . 0 3 4 0 . 1 7 1 - 0 .198 0.843
ed u c 0 . 0 1 8 0 . 0 4 7 0 . 3 8 7 0.699
in c o 0 . 1 0 4 0 . 0 3 6 2 . 9 0 0 0.004
su p p t ~
pa r t y -0 . 5 6 7 0 . 0 2 2 - 25.368 0.000
in t e r 0 . 1 4 5 0 . 0 2 5 5 . 8 78 0.000
Variances:
Es t i m a t e S t d . E r r z- v alue P(>|z|)
.p a r t y 3 . 4 7 9 0 . 1 9 6 1 7 .748 0.000
.i n t e r 2 . 9 7 3 0 . 1 6 8 1 7 .748 0.000
.s u p p t 1 . 1 9 9 0 . 0 6 8 1 7 .748 0.000
These results indicate that respondents who were Caucasians and who
had higher income were stronger Republicans. In contrast, those who
had higher education were stronger Democrats.
Age and income had a positive relationship with political interest. Older
and richer respondents showed higher levels of interest in politics. Then,
stronger Republicans and more politically interested individuals more
strongly supported Trump.
2. Maybe, I need to know how much variance in support for Trump this
model accounts for.
4. Finally, the above result only shows one model fit index: Minimum
Function Test Statistic (chi-square). But, chi-square tends to be sensitive
to sample size. When data have large sample size, chi-square tends to
be significant (indicating that the model is significantly different from
data, instead of approximating data). So, I may end up making an
erroneous conclusion. I need to see other model fit indices.
> s u m m a r y ( r e s u l t s 1 , s t andardized=TRUE,
f i t . m e a s u r e s = T R U E , r s q =TRUE, modindices=TRUE)
Used Total
Nu m b e r o f o b s e r v a t i o n s 630 677
Estimator ML
Mi n i m u m F u n c t i o n T e s t Statistic 13.542
De g r e e s o f f r e e d o m 6
P- v a l u e ( C h i- s q u a r e ) 0.035
M o d e l t e s t b a s e l i n e m o del:
M i n i m u m F u n c t i o n T e s t Statistic 566.979
De g r e e s o f f r e e d o m 1 8
P- v a l u e 0 . 0 0 0
U s e r m o d e l v e r s u s b a s e line model:
C o m p a r a t i v e F i t I n d e x (CFI) 0.986
Tu c k e r -L e w i s I n d e x ( TLI) 0.959
L o g l i k e l i h o o d a n d I n f o rmation Criteria:
L o g l i k e l i h o o d u s e r m o d el (H0) -7954.627
N u m b e r o f f r e e p a r a m e t ers 15
Ak a i k e ( A I C ) 1 5 9 3 9 . 2 5 4
Ba y e s i a n ( B I C ) 1 6 0 0 5 . 940
R o o t M e a n S q u a r e E r r o r of Approximation:
RMSEA 0.045
P- v a l u e R M S E A < = 0 . 0 5 0.558
S t a n d a r d i z e d R o o t M e a n Square Residual:
SRMR 0.019
Parameter Estimates:
Information Expected
St a n d a r d E r r o r s S t a n d ard
Regressions:
pa r t y ~
in t e r ~
su p p t ~
R -S q u a r e :
Es t i m a t e
pa r t y 0 . 0 8 6
in t e r 0 . 0 5 2
su p p t 0 . 5 1 9
Modification Indices:
To test the indirect effects with lavaan, apparently I need to give labels
to each parameter and use those labels in a model syntax. Then, I use
the “:=” operator to define new parameters. So, I type the following.
1
2 model2 <- 'party ~ a1*age + a2*sex + a3*race + a4*educ + a5*inco
3 inter ~ a6*age + a7*sex + a8*race + a9*educ + a10*inco
4 suppc ~ b1*party + b2*inter + c1*age + c2*sex + c3*race + c4*educ + c5*inco
a1b1 := a1*b1
5 a2b1 := a2*b1
6 a3b1 := a3*b1
7 a4b1 := a4*b1
8 a5b1 := a5*b1
9 a6b2 := a6*b2
a7b2 := a7*b2
10 a8b2 := a8*b2
11 a9b2 := a9*b2
12 a10b2 := a10*b2
13 total := c1 + c2 + c3 + c4+ c5 + (a1*b1) + (a2*b1) + (a3*b1) + (a4*b1) + (a5*b1) +
14
> s u m m a r y ( r e s u l t s 2 , s t andardized=TRUE,
f i t . m e a s u r e s = T R U E , r s q =TRUE)
Us e d T o t a l
Nu m b e r o f o b s e r v a t i o n s 630 677
Es t i m a t o r M L
Mi n i m u m F u n c t i o n T e s t Statistic 0.100
De g r e e s o f f r e e d o m 1
P- v a l u e ( C h i- s q u a r e ) 0.752
M o d e l t e s t b a s e l i n e m o del:
Mi n i m u m F u n c t i o n T e s t Statistic 576.823
De g r e e s o f f r e e d o m 1 8
P- v a l u e 0 . 0 0 0
U s e r m o d e l v e r s u s b a s e line model:
Co m p a r a t i v e F i t I n d e x (CFI) 1.000
Tu c k e r -L e w i s I n d e x ( TLI) 1.029
L o g l i k e l i h o o d a n d I n f o rmation Criteria:
Nu m b e r o f f r e e p a r a m e ters 20
Ak a i k e ( A I C ) 1 5 9 2 2 . 5 7 8
Ba y e s i a n ( B I C ) 1 6 0 1 1 . 492
R o o t M e a n S q u a r e E r r o r of Approximation:
RM S E A 0 . 0 0 0
P- v a l u e R M S E A < = 0 . 0 5 0.884
S t a n d a r d i z e d R o o t M e a n Square Residual:
SR M R 0 . 0 0 2
Parameter Estimates:
In f o r m a t i o n E x p e c t e d
St a n d a r d E r r o r s S t a n d ard
Regressions:
pa r t y ~
in t e r ~
su p p c ~
Variances:
R -S q u a r e :
Es t i m a t e
pa r t y 0 . 0 8 6
in t e r 0 . 0 5 2
su p p c 0 . 5 3 9
Defined Parameters:
>
When I see the results at the very bottom of the output, I see statistical
significance of each indirect effect specified. For example, the indirect
effect of age on support for Trump through party affiliation (a1b1) is not
significaint (p = .236). However, education has an indirect effect on
support for Trump through party affiliation (b = .083, p = .005).
46 total :=
c 1 + c 2 + c 3 + c 4 + c 5 + ( a 1 * b 1 ) +(a2*b1)+(a3*b1)+(a4*b1)+(a5*b1
) + ( a 6 * b 2 ) + ( a 7 * b 2 ) + ( a 8 * b2)+(a9*b2)+(a10*b2) total -
0 . 6 5 6 0 . 1 6 5 -3 . 9 8 2 0 . 0 00 -0.979 -0.333 - 0.656 - 0.151
- 0. 4 1 6
>
Wrapping Up
I think there are some other things I should do, such as analyzing
localized residuals and replicating the results with existing SEM
packages. But, with lavaan, I learned I can do many things.