Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Week 9

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

1

Methods of Multivariate Analysis

There are three categories of analysis:

• Univariate analysis, which looks at just one variable


• Bivariate analysis, which analyzes two variables simultaneously
• Multivariate analysis, which looks at more than two variables
simultaneously

Univariate and Bivariate analyses can be treated as a particular case of


Multivariate analysis.

Multivariate analysis techniques can be divided into two categories:

• Dependence techniques: A dependence technique may be defined


as one in which a variable or set of variables is identified as the
dependent variable to be predicted or explained by other variables
known as independent variables.
Examples: Generalized Linear Model (includes Simple Linear
Regression, Multiple Linear Regression, Logistic Regression,
Poisson Regression, Negative Binomial Regression), Principal
Component Analysis, Canonical correlation analysis, Analysis of
Variance (ANOVA), Multivariate Analysis of Variance
(MANOVA), Log-linear model for contingency table, etc.

• Interdependence techniques: An interdependence technique is one


in which no single variable or group of variables is defined as
independent or dependent. Instead, the procedure involves the
simultaneous analysis of all variables in the set.

Examples: Factor Analysis, Cluster Analysis, Multidimensional


Scaling (MDS), Correspondence Analysis, etc.

MLR_Logistic.Docx
2

Regression analysis measures the probable movement of the Dependent


(Response or Endogenous) variable for a unit increase of the
independent variable.
Regression analysis is used for one of two purposes:
• predicting the value of the dependent variable when
information about the independent variables is known
• predicting the effect of an independent variable on the
dependent variable.

Definition
Dependent variable: the variable we wish to explain or predict.
Independent/exogenous/explanatory variable: the variable
we use to explain or predict the dependent variable.

Types of Regression Models


There are numerous regression analysis approaches available
for making predictions. Various parameters, including the
number of independent variables, the form of the regression line,
and the type of dependent variable, determine the choice of
technique for regression analysis.

1. Simple Linear Regression (Linear relationship exists between


dependent and independent variables)
2. Multiple Linear Regression (linear relationship exists
between dependent and more than one independent
variable)
3. Binary Logistic Regression (dependent variable is binary)

MLR_Logistic.Docx
3

4. Multicategory Logistic regression (dependent variable is


multicategory)
5. Polynomial Regression: (nonlinear relationship between
dependent and independent variables)
6. Ridge Regression: (independent variables are highly
correlated)
7. Quantile Regression (when outliers, high skewness and
heteroscedasticity exist in the data.).
8. Bayesian Linear Regression
9. Principal Components Regression (for many
independent variables or multicollinearity exist in data).

10. Partial Least Squares Regression (many independent


variables with a high probability of multicollinearity between
the variables).

10. Elastic Net Regression (suitable for strongly correlated


data.
11. Support Vector Regression (suitable for linear and
nonlinear models).
12. Ordinal Regression (for ordinal dependent variable)
13. Poisson Regression (when the dependent variable has count data)
14. Negative Binomial Regression (used for overdispersed count
dependent data)
15. Quasi-Poisson Regression (used for overdispersed count-
dependent data)
16. Tobit Regression (when censoring exists in the dependent variable)

MLR_Logistic.Docx
4

17. Jackknife regression (a resampling procedure)


18. Ecological Regression (used to study predicted human behaviour
within a population data set), etc.

MLR_Logistic.Docx
5

1. Simple linear regression model:


(Simple – the model contains only one independent variable, linear – the
power of the regression coefficient is 1).
The population Simple Linear Regression Model can be stated
as
α + β X + ε , ε (the random error) ~ N 0,σ 2
y= ( )
(
Thus, y ~ N X β ,σ
2
) and X is fixed for a particular analysis.
Before estimating the model, we must visualize the data to check
the
• the linear relationship between X and Y,
• normality of Y
• presence of an outlier.
If Y is not normally distributed, we can use Box-cox
transformation to transform Y to normal.
Using the Method of least square principle, we can estimate the
model as

yˆ =
a + bX , where a =

y − bx , and b =
( x − x )( y − y ) i i

∑(x − x ) i
2

[Least square principle: Minimize the sum of squares of errors


n n

∑ ( yi − yˆi=
) ∑ i
( )
2 2
(SSE), y − ( a + bx ) with respect to a and
=i 1 =i 1
b, where ŷ is an estimate of E ( y / X= x )= α + β x , the conditional
mean of y given X=x (fixed).]
The accuracy of the estimated model can be evaluated by the adjusted
coefficient of determination ( ) , which varies from 0 to 1.
R2 R 2
(When should we use a multiple linear regression model?)
If the coefficient of determination is unsatisfactory (low), we
incorporate/add meaningful and relevant independent variables
into the model to create the Multiple Linear Regression Model
(MLRM).

MLR_Logistic.Docx
6

2. Multiple Linear Regression Model (MLRM)


The population multiple regression model with k independent
variables can be written as
(
y= β 0 + β1 X 1 + β 2 X 2 + + β3 X 3 +  + β k X k + ε , where ε ~ N 0,σ 2 )
In matrix form,
y Xβ +ε ,
= [ yn×1 = X n×k β k×1 ] , n is the sample size or
number of observations.

Using the Method of least square, the estimated regression


model can be written as
yˆ= b0 + b1 X 1 + b2 X 2 + +b3 X 3 +  + bk X k
=
In matrix form, yˆ X= βˆ (or, yˆ Xb)
−1
where, b= βˆ= ( X ′X ) X ′y

The total variation of y can be expressed in terms of variation


due to Regression and Error. Mathematically,
n n n

∑ ( yi − y ) ∑ ( yˆi − y ) ∑ i i
( )
2 2 2
= + y − ˆ
y
i 1 =i 1 =i 1
Total SS (TSS) = Regression SS (RSS) + Error SS (ESS)
The above expression is beneficial in constructing the ANOVA
table and Testing hypotheses.

Steps testing the validity or accuracy of the estimated


model:
Test whether all independent variables have a simultaneous
influence on the target variable or not (Global Test).
We require the following ANOVA table to test
H 0 : β=
1 β=
2 β=
3  = β= k 0 against
H1 : At least one regression coefficient (βi )is non-zero

MLR_Logistic.Docx
7

Under H 0 the test statistic is


MSSR / k
F ~ F (k , n − k − 1)
MSSE / (n − k − 1)

ANOVA table
SV df SS F
MSS
Regr k RSS= MSSR=R F =
essio n SS/k
∑ ( ) MSSR / k
2
n ˆ
y i − y
i =1 MSSE / (n − k − 1)
Error n-(k+1) ESS= MSSE=E
n SS/(n-k-
∑ ( ) 2
yi − ˆ
y i 1)
i =1
Total n-1 TSS=
n

∑ i
( ) 2
y − y
i =1
Abbreviation: SV – Source of variation, df – degrees of freedom, SS – Sum
of Squares, MSS – Mean Sum of Squares, RSS – Regression Sum of
Squares, ESS – Error Sum of Squares, TSS – Total Sum of squares.

We will not proceed anymore if H 0 is accepted. We continue the


investigation/analysis if at least one regression coefficient is non-zero,
i.e., at least one independent variable has a linear influence on the
dependent variable. (use the p-value of the ANOVA table to make a
decision).
The dependent variable is continuous in Multiple Linear Regression
because of the normality of the error term.

MLR_Logistic.Docx
8

Note that a scientific calculator can estimate a simple linear regression


model and correlation coefficient, whereas multiple regression and
logistics regression require a computer.
We test each regression coefficient or a subset of the coefficients if the
above global Test is rejected.
To test
H 0 : βi = 0 ( X i has no linear influence on y) against
H a : βi ≠ 0 ( X i has a linear influence on y),
Under H 0 the test statistic is
bi − 0
=t ~ t (n − k − 1)
se(bi )
We do not reject H 0 if p-value>significance level. (p-value can be
obtained directly from computer output but not from a statistical table).

MLR_Logistic.Docx
9

Please study the file SimpleLinearRegEx1.pptx for a better


understanding.
Example 1 (Simple Linear Regression Analysis):

A real estate agent wishes to examine the relationship between a home's


selling price (measured in $) and its size (measured in square feet).

House Price in $1000s Square Feet


(X) (Y)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700

Questions:

a) Identify the dependent and independent variables.


b) Construct a scatter diagram and comment on it.
c) Determine and interpret the correlation coefficient and coefficient
of determination.
d) Estimate the regression equation of the selling price of a home on
its size.
e) Predict the price for a house with 2000 square feet.
f) Test the significance of the linear relationship between price and
size at a 5% significance level.
g) Construct a 95% confidence interval for the population correlation
coefficient.
h) Test the linear influence of size on price at a 5% significance level.

MLR_Logistic.Docx
10

Solution: (A scientific calculator can be used for this simple linear


regression problem)

The majority of the above questions can be answered from the following
computer output:

MLR_Logistic.Docx
11

Please study the file MultipleLinearRegEx2.pptx for a better


understanding.

Example 2 (Multiple Linear Regression Analysis):

A distributor of frozen desert pies wants to evaluate factors thought to


influence demand. Data are collected for 15 weeks on Pie sales (units per
week), price (in $) and Advertising cost($100’s)
Price
Pie Sales ($) Advertising ($100s)
Week (Y) (X1) (X2)
1 350 5.5 3.3
2 460 7.5 3.3
3 350 8 3
4 430 8 4.5
5 350 6.8 3
6 380 7.5 4
7 430 4.5 3
8 470 6.4 3.7
9 450 7 3.5
10 490 5 4
11 340 7.2 3.5
12 300 7.9 3.2
13 440 5.9 4
14 450 5 3.5
15 300 7 2.7

a) Identify the dependent and independent variables.


b) Construct scatter diagrams to check linearity and the presence of
outliers.
c) Estimate the regression equation of pie sales on price and
advertisement cost. Interpret the estimated coefficients.
d) Predict the pie sales when the price per unit is $6.35, and the
advertisement cost is $4100.

MLR_Logistic.Docx
12

e) Test the significance of the linear relationship between price and


size at a 5% significance level.
f) Test the joint influence of price and advertisement on pie sales.
g) Construct a 95% confidence interval for the population regression
coefficient of pie sales on price per unit.

The majority of the above questions can be answered from the following
computer output:

The basic framework of the above regression analyses (SLRM and


MLRM) is the Classical Regression Model (CLRM). The CLRM is based on
a set of assumptions. Gujarati (2008) outlined about ten different
assumptions. One of the assumptions is that each error term ε i is
independently and normally distributed with mean 0 and constant
variance σ 2 (identical). That is,
(
ε i ~ NIID 0,σ 2 )
(NIID means normal, independent and identical distribution)
The above assumption implied that the dependent variable is also
normally distributed as follows
y ~ NIID( X β ,σ 2 ) ,

MLR_Logistic.Docx
13

where, X β = β 0 + β1 x1 + β 2 x2 +  + β k xk .
In case of violating at least one assumption of normality, independence,
and constant variance, we use a Generalized linear model (GLM)
(Agresti, 2019, chapter 3). Simple Linear Regression and Multiple Linear
Regression are particular cases of GLM.

We use a generalized linear model for different choices of the dependent


variable. A short list is given below:
Types of Generalized Linear Models
Type/distribution of Independent Model
dependent variable/Systematic
variable/Random Component
Component
Continuous/Normal Continuous (one Simple Linnear
independent var) Regression
Continuous/Normal Continuous (more Multiple Linnear
than one Regression
independent var)
Continuous/Normal Categorical Analysis of Variance
Continuous/Normal Mixed Analysis of
Covariance
Binary/Binomial Mixed

Mixed Multinomial logistic


Multicategory/Multinomial Regression
Count/Poisson Mixed Loglinear
Count with overdispersion Mixed Negative Binomial
/Negative Binomial Regression
Ordinal/cumulative normal Mixed Cumulative Logistic
Model
The Simple, Multiple Linear Regression, and binary Logistic Regression models are
particular cases of GLM.

MLR_Logistic.Docx
14

BINARY LOGISTIC REGRESSION


Binary dependent outcomes violets the assumption of normality. In this
situation, we use binary logistic Regression, as per the above
discussion.
Despite its name, logistic Regression is a method of classification. It is
conceptually similar to linear Regression.
Examples of binary dependent outcomes:
• The patient survives the operation or does not.
• The accused is convicted or is not.
• The customer makes a purchase or does not.
• The marriage lasts at least five years or does not.
Using the technique of GLM, the logistic regression model with k
explanatory variables can be expressed in the following form:

 p 
ln   = β 0 + β1 x1 + β 2 x2 +  + β k xk = X β .
 1− p  .

eX β 1
Thus,=p P=
(Y 1/=
x) E=
(Y 1/=
x) = Xβ −Xβ
.
1+ e 1+ e
Note that, Logistic Regression is nonlinear in regression coefficients.
p
Odds= = exp( β 0 + β1 x1 + β 2 x2 +  + β k xk )
1− p
= exp ( β 0 ) × exp ( β1 x1 ) × exp ( β 2 x2 ) ×  × exp ( β k xk )
=
where, = 1/ x) and
p Pr(Y (1-=
p) =
Pr(Y 0 / x)

MLR_Logistic.Docx
15

Defn: Odds is the ratio of the probability of happening to the


probability of not happening of an event.

The higher the probability, the greater the odds.

1
The sigmoid function y = (so named because it looks
1 + e− X β
like an s) is also called the logistic function. It takes a real value
and maps it to the range [0, 1]. It is nearly linear around 0, but
outlier values get squashed toward 0 or 1.

For given values of x's, we can predict p. If p>0.5, the


observation(s) belongs to group 1; otherwise, it belongs to group
0.

Note that the Method of least squares is used in MLR and Maximum
Likelihood in Logistic Regression. We require a Computer to estimate
each model listed in the GLM table.
In terms of log odds, Logistic Regression is like regular Regression

MLR_Logistic.Docx
16

• The exponential function of the logistic regression coefficients are


odds ratios
• When xk is increased by one unit and all other independent
βk
variables are held constant, the odds of Y=1 are multiplied by e .
• Another way of writing ea+bX is ea(eb)X. That means that a one-unit
increase in X multiplies the odds by eb.

The goodness of fit and accuracy of the Binary Logistic Regression


model

In Linear Regression, we check adjusted R², F Statistics, MAE, and


RMSE to evaluate model fit and accuracy.
Logistic Regression employs different sets of metrics. In this case, we
deal with probabilities and categorical values. The following are the
evaluation metrics used for Logistic Regression:
1. Akaike Information Criteria (AIC): The model with the lowest AIC
will be relatively better.
2. Null Deviance and Residual Deviance: The deviance of an
observation is computed as -2 times the log-likelihood of that
observation. The larger the difference between null and residual
deviance, the better the model. ( Also, whichever model has a lower
null deviance, the model explains deviance pretty well and is better.
The lower the residual deviance, the better the model.
Practically, AIC is always given preference above deviance to evaluate
model fit.

3. Confusion Matrix

Confusion matrix is the most crucial metric commonly used to evaluate


classification models. The skeleton of a confusion matrix looks like this:

MLR_Logistic.Docx
17

The confusion matrix avoids "confusion" by measuring the actual and


predicted values in a tabular format. The table above shows the Positive
class = 1 and the Negative class = 0. Following are the metrics we can
derive from a confusion matrix:
TP + TN
Accuracy =
TP + TN + FP + FN (It determines the overall predicted
accuracy of the model)
TP
TPR =
True Positive Rate, TP + FN
. ( It indicates how many positive
values, out of all the positive values, have been correctly predicted.
TPR = 1 - False Negative Rate. It is also known as Sensitivity
or Recall.
FP
FPR =
False Positive Rate: TP + TN , It indicates how many negative
values, out of all the Negative values, have been incorrectly predicted.
FPR = 1 - True Negative Rate.

MLR_Logistic.Docx
18

TN
TNR =
True Negative Rate, TN + FP , It indicates how many
negative values, out of all the negative values, have been correctly
predicted. It is also known as Specificity.
FN
FNR =
False Negative Rate, FN + TP , It indicates how many
positive values, out of all the positive values, have been incorrectly
predicted.

Precision = TP TP+ FP , It indicates how many values, out of all the


predicted positive values, are positive.
F-score is the harmonic mean of precision and recall. It lies between 0
and 1. The higher the value, the better the model. It is formulated as
2
=
F − score = 2 ( precision*recall ) / ( precision+recall ) .
1 1
+
Precision Recall

MLR_Logistic.Docx
19

Example ( Daniel: pp.573, EXAMPLE 11.4.2):


Cardiac rehabilitation programs offer "information, support, and
monitoring for return to activities, symptom management, and risk
factor modification."

The researchers conducted a study to identify factors associated with


participation in such programs among women.

The following data (Table 11.4.3 in the text) are the ages of 185 women
discharged from a hospital in Australia who met eligibility criteria
involving discharge for myocardial infarction, artery bypass surgery,
angioplasty, or stent.

We wish to use these data to develop a model (Binary Logistic


Regression Model) regarding the relationship between age in years
(independent variable,
X) and participation in a cardiac rehabilitation program (dependent
variable,
Y) (ATT=1 if participated, and ATT=0 if not).
We also wish to know if we may use the results of our analysis to
predict the likelihood of participation by a woman if we know her age:

TABLE 11.4.3 Ages of Women Participating and Not Participating


in a Cardiac Rehabilitation Program
age att age att age att age att age att age att
50 0 71 0 73 0 75 0 41 1 69 1
59 0 69 0 68 0 68 0 64 1 66 1
42 0 78 0 72 0 81 0 46 1 57 1
50 0 69 0 59 0 74 0 65 1 60 1
34 0 74 0 64 0 65 0 50 1 63 1
49 0 86 0 78 0 81 0 61 1 63 1
67 0 49 0 68 0 62 0 64 1 56 1

MLR_Logistic.Docx
20

44 0 63 0 67 0 85 0 59 1 70 1
53 0 63 0 55 0 84 0 73 1 70 1
45 0 72 0 71 0 39 0 73 1 63 1
79 0 64 0 80 0 52 0 65 1 63 1
46 0 72 0 75 0 67 0 67 1 65 1
62 0 79 0 69 0 82 0 60 1 67 1
58 0 75 0 80 0 84 0 69 1 68 1
70 0 70 0 79 0 79 0 61 1 84 1
60 0 73 0 71 0 81 0 79 1 69 1
67 0 66 0 69 0 74 0 66 1 78 1
64 0 75 0 78 0 85 0 68 1 69 1
62 0 73 0 75 0 92 0 61 1 79 1
50 0 71 0 71 0 69 0 63 1 83 1
61 0 72 0 69 0 83 0 70 1 67 1
69 0 69 0 77 0 82 0 68 1 47 1
74 0 76 0 81 0 85 0 59 1 57 1
65 0 60 0 78 0 82 0 64 1 66 1
80 0 79 0 76 0 80 0 62 1
69 0 78 0 84 0 74 1 74 1
77 0 62 0 74 0 50 1 61 1
61 0 73 0 59 0 55 1 69 1
72 0 46 0 81 0 66 1 76 1
67 0 57 0 74 0 49 1 71 1
73 0 53 0 77 0 55 1 61 1
75 0 40 0 59 0 73 1 46 1
(Data file: Logisticdata1.xlsx)
Partial SPSS output
Model Summary

-2 Log Cox & Snell Nagelkerke


Step likelihood R Square R Square

1 229.520a .037 .051

a. Estimation terminated at iteration number 4


because parameter estimates changed by less
than .001.

Confusion/Classification Tablea
Predicted
ATT Percentage
Observed 0 1 Correct

MLR_Logistic.Docx
21

Step 1 ATT 0 111 10 91.7


1 58 5 7.9
Overall 63.0
Percentage
a. The cut value is .500

Variables in the Equation


B S.E. Wald df Sig. Exp(B)
Step 1a AGE -.038 .015 6.710 1 .010 .963
Constant 1.875 .981 3.653 1 .056 6.519

From the above SPSS output, we can write the estimated Binary Logistic
Regression Model as
 pˆ 
yˆi =ln  i =αˆ + βˆ xi =1.875 − 0.038 xi
 1 − pˆ i 
The predicted probability of attending cardiac rehabilitation for a
woman aged xi is
1
pˆ i =
1 + e − (1.875−0.038 xi )
1
=
For x = 57, pˆ = 0.427759
1 + e − (1.875−0.038×57)

=pˆ 57 0.427759 < 0.50 , Thus, a 57-year-old woman did not participate
in the program.
1
=
For x = 37, pˆ = − (1.875 − 0.038×37)
0.615147
1+ e
=pˆ 37 0.615147 > 0.50 , Thus, a 37-year-old woman participated in the
program.

MLR_Logistic.Docx
22

Test: We can check the adequacy of the logistic model by testing the null
hypothesis that the slope of the regression line/coefficient of age (x) is
zero. That is, we test the null hypothesis
H 0 : β = 0 versus the two-sided alternative H a : β ≠ 0 .
Under the null hypothesis, the test statistic is
2
 βˆ 
W =   χ12 (distributed as Chi-square with 1 degree of freedom)
 se( β ) 
ˆ
.
From computer output, W = 6.710 with p-value 0.01<0.05.
Thus, we reject the null hypothesis at a 5% significance level.
The logistic regression coefficient is significant, and hence, the logistic
regression model is adequate. That is, the age of a woman influences her
participation in the program.

Accuracy can be observed from the Confusion matrix, and various


accuracy measures can be obtained from this confusion matrix.
It is observed from the Confusion/Classification table that only 63% of
the data were correctly reclassified, with those participating in the
rehabilitation program much more poorly classified than those who did
not attend the program. The frequency distribution shows the large
number of ATT=1 subjects who were misclassified as ATT=0 based on
the model.

References:
Agresti, A. (2019) An Introduction to categorical Analysis (Chapter 4),
Wiley & Sons.
Agresti, A. (2019). AN INTRODUCTION TO CATEGORICAL DATA
ANALYSIS, chapter 3, Wiley.

MLR_Logistic.Docx
23

Daniel, W. W. (2013). BIOSTATISTICS: A Foundation for Analysis in the


Health Sciences, (Chapter 9-11).

Gujarati, D. (2014). Econometrics by Example. Chapter 1, Palgrave.

Hair, F. F. (2019). Multivariate Data Analysis, (Chapter 1), Cengage


Learning.

Newbold, P. (2023). Statistics for Business and Economics, Pearson (

MLR_Logistic.Docx

You might also like