Multiple Regression in SPSS: X X X y X X y X X y
Multiple Regression in SPSS: X X X y X X y X X y
I.
STAT 314
The accompanying data is on y = profit margin of savings and loan companies in a given year, x1 = net revenues in that year, and x2 = number of savings and loan branches offices. x1 3.92 3.61 3.32 3.07 3.06 3.11 3.21 3.26 3.42 a. b. c. d. e. f. g. h. i. j. k. l. m. x2 7298 6855 6636 6506 6450 6402 6368 6340 6349 y 0.75 0.71 0.66 0.61 0.70 0.72 0.77 0.74 0.90 x1 3.42 3.45 3.58 3.66 3.78 3.82 3.97 4.07 x2 6352 6361 6369 6546 6672 6890 7115 7327 y 0.82 0.75 0.77 0.78 0.84 0.79 0.70 0.68 x1 4.25 4.41 4.49 4.70 4.58 4.69 4.71 4.78 x2 7546 7931 8097 8468 8717 8991 9179 9318 y 0.72 0.55 0.63 0.56 0.41 0.51 0.47 0.32
Determine the multiple regression equation for the data. Compute and interpret the coefficient of multiple determination, R2 . At the 5% significance level, determine if the model is useful for predicting the response. Create all partial plots to check Assumption 1 as well as to identify outliers and potential influential observations. Obtain the residuals and create residual plots. Decide whether or not it is reasonable to consider that the assumptions for multiple regression analysis are met by the variables in questions. At the 5% significance level, does it appear that any of the predictor variables can be removed from the full model as unnecessary? Obtain and interpret 95% confidence intervals for the slopes, i, of the population regression line that relates net revenues and number of branches to profit margin. Are there any multicollinearity problems (i.e., are net revenues and number of branches collinear [estimating similar relationships/quantities])? Obtain a point estimate for the mean profit margin with 3.5 net revenues and 6500 branches. Test the alternative hypothesis that the mean profit margin with 3.5 net revenues and 6500 branches is greater than 0.70. Test at the 5% significance level. Determine a 95% confidence interval for the mean profit margin with 3.5 net revenues and 6500 branches. Find the predicted profit margin for my bank with 3.5 net revenues and 6500 branches. Determine a 95% prediction interval for the profit margin for Dr. Streets bank with 3.5 net revenues and 6500 branches.
1. 2.
Enter the values of the three variables into SPSS. Since we want to predict the profit margin for a bank with 3.5 net revenues and 6500 branches, enter the number 3.5 in the x1 variable column and 6500 in the x2 variable column of the data window after the last row. Enter a . for the corresponding y variable value (this lets SPSS know that we want a prediction for this value and not to include the value in any other computations). (See figure, below.)
3.
Select Analyze
Regression
4.
Select Profit Margin as the dependent variable and Net Revenues and Number of Branches as the independent variables, and select the Backward method. Click Statistics, select Estimates and Confidence intervals for the regression coefficients, select Model fit, Collinearity diagnostics, and click Continue. Click Plots, select Normal Probability Plot, Produce all partial plots, and click Continue. Click Save, select Unstandardized predicted values and S.E. of mean predictions ( sy ), select Unstandardized and Studentized residuals, select Mean and Individual prediction intervals at the 95% level (or whatever level the problem requires), and click Continue. Click OK. (See the four figures, following.)
The output from this procedure is extensive and will be shown in parts in the following answers.
a.
From the above output, the regression equation is: y = 1.564 + 0.237x1 0.0002491x 2 . b. Compute and interpret the coefficient of multiple determination, R2 .
Model Summary b Adjusted R Square .853 Std. Error of the Estimate 5.330E-02 Model 1 R .930a R Square .865
a. Predictors: (Constant), Number of Branches, Net Revenues b. Dependent Variable: Profit Margin
The coefficient of multiple determination is 0.865; therefore, about 86.5% of the variation in the profit margin is explained by net revenues and number of branches for the savings and loan banks. The regression equation appears to be very useful for making predictions since the value of R 2 is close to 1. c. At the 5% significance level, determine if the model is useful for predicting the response.
ANOVA Model 1 Sum of Squares .402 6.250E-02 .464 df 2 22 24
b
F 70.661
Sig. .000a
a. Predictors: (Constant), Number of Branches, Net Revenues b. Dependent Variable: Profit Margin
Step 1 :
Hypotheses
H 0 : 1 = 2 = 0 H a : at least one i 0
Significance Level
= 0.05
Rejection Region
Since p-value < 0.001 0.05, we shall reject the null hypothesis. At the = 0.05 level of significance, there exists enough evidence to conclude that at least one of the predictors is useful for predicting profit margin; therefore the model us useful.
State conclusion in words
d.
Create all partial plots to check Assumption 1 as well as to identify outliers and potential influential observations.
Partial Regression Plot Dependent Variable: Profit Margin
.2 .2
Net Revenues
Number of Branches
Profit Margin appears to be linearly related to each of the predictor variables with no visible potential outliers or influential observations (no points away from the main cluster of points); thus, Assumption 1 appears to be satisfied. e. Obtain the residuals and create residual plots (normal probability plot of residuals and scatterplots for each predictor variable versus the studentized residuals). Decide whether or not it is reasonable to consider that the assumptions for multiple regression analysis are met by the variables in questions. The residuals [res] and standardized values [sre] (as well as the predicted values [pre], the standard errors of each prediction [sep], the prediction interval endpoints [lici & uici], and the confidence interval endpoints [lmci & umci]) can be found in the data window.
Normal P-P Plot of Residuals
Dependent Variable: Profit Margin
1.00 3
.75 1 .50 0 -1 .25 -2 0.00 0.00 .25 .50 .75 1.00 -3 3.0
6000
7000
8000
9000
10000
Net Revenues
Number of Branches
The normal plot of the residuals shows the points close to a diagonal line; thus, Assumption 2 is satisfied. Each of the studentized residual plots shows a random scatter of points with constant variability; thus, Assumption 3 is met.
f.
At the 5% significance level, does it appear that any of the predictor variables can be removed from the full model as unnecessary?
Coefficientsa Standar dized Coefficients Beta .987 -1.797 t 19.705 4.269 -7.772 Sig. .000 .000 .000 Unstandardized Coefficients Model 1 (Constant) Net Revenues Number of Branches B 1.564 .237 -2.491E-04 Std. Error .079 .056 .000 95% Confidence Interval for B Lower Bound 1.400 .122 -.0003155 Upper Bound 1.729 .352 -.0001826
Step 1 :
Hypotheses
H 0 : 1 = 0 (net revenue is not useful for predicting profit margin) H a : 1 0 (net revenue is useful for predicting profit margin)
assuming that number of branches is included in the model Significance Level Rejection Region Reject the null hypothesis if p-value 0.05. Test Statistic and p -value (see above) T = 4.269, p-value < 0.001 Conclusion Since p-value < 0.001 0.05, we shall reject the null hypothesis. State conclusion in words At the = 0.05 level of significance, there exists enough evidence to conclude that the slope of the net revenue variable is not zero and, hence, that net revenues is useful (with number of branches) as a predictor of profit margin for savings and loan banks. Hypotheses
= 0.05
Step 1 :
H 0 : 2 = 0 (number of branches is not useful for predicting profit margin) H a : 2 0 (number of branches is useful for predicting profit margin)
assuming that net revenue is included in the model Significance Level Rejection Region Reject the null hypothesis if p-value 0.05. Test Statistic and p -value (see above) T = 7.772, p-value < 0.001 Conclusion Since p-value < 0.001 0.05, we shall reject the null hypothesis. State conclusion in words At the = 0.05 level of significance, there exists enough evidence to conclude that the slope of the number of branches variable is not zero and, hence, that number of branches is useful (with net revenues) as a predictor of profit margin for savings and loan banks.
= 0.05
g.
Obtain and interpret 95% confidence intervals for the slopes, i, of the population regression line that relates net revenues and number of branches to profit margin.
We are 95% confident that the slope for net revenues is somewhere between 0.122 and 0.352. In other words, we are 95% confident that for every single-unit increase in net revenue, the average profit margin increases between 0.122 and 0.352. We are 95% confident that the slope for number of branches is somewhere between 0.0003155 and 0.0001826. In other words, we are 95% confident that for every additional branch, the average profit margin decreases between 0.0003155 and 0.0001826.
h.
Are there any multicollinearity problems (i.e., are net revenues and number of branches collinear [estimating similar relationships/quantities])?
Coefficientsa Standar dized Coefficients Beta .987 -1.797 t 19.705 4.269 -7.772 Sig. .000 .000 .000 Unstandardized Coefficients Model 1 (Constant) Net Revenues Number of Branches B 1.564 .237 -2.491E-04 Std. Error .079 .056 .000 95% Confidence Interval for B Lower Bound 1.400 .122 -.0003155 Upper Bound 1.729 .352 -.0001826
Since neither of the predictor variables has a variance inflation factor (VIF) greater than ten (both VIFs are 8.732), there are no apparent multicollinearity problems; in other words, there is no variable in the model that is measuring the same relationship/quantity as is measured by another variable or group of variables.
The remaining parts will be completed using output from the Data window.
i.
Obtain a point estimate for the mean profit margin with 3.5 net revenues and 6500 branches.
The point estimate (pre_1) is 0.77567.
j.
Test the alternative hypothesis that the mean profit margin with 3.5 net revenues and 6500 branches is greater than 0.70. Test at the 5% significance level.
Step 1 : Hypotheses
( when ( when
Step 2 :
Significance Level
= 0.05
Critical Value(s) and Rejection Region(s) Critical Value: t ,df = n( k +1) = t 0.05,df = 22 = t 90%CI ,df = 22
= 1.72
Step 4 :
Reject the null hypothesis if T 1.72 (or if p-value 0.05). Test Statistic
T=
Step 5 : Step 6 :
Conclusion Since 5.5436 1.72 (p-value < 0.001 0.05), we shall reject the null hypothesis. State conclusion in words At the = 0.05 level of significance, there exists enough evidence to conclude that the mean profit margin with 3.5 net revenues and 6500 branches is greater than 0.70.
k.
Determine a 95% confidence interval for the mean profit margin with 3.5 net revenues and 6500 branches.
We are 95% confident that the mean profit margin with 3.5 net revenues and 6500 branches is somewhere between 0.74736 (lmci_1) and 0.80398 (umci_1).
l.
Find the predicted profit margin for my bank with 3.5 net revenues and 6500 branches.
The predicted profit margin is 0.77567.
m.
Determine a 95% prediction interval for the profit margin for Dr. Streets bank with 3.5 net revenues and 6500 branches.
We are 95% certain that the profit margin for Dr. Streets bank with 3.5 net revenues and 6500 branches will be somewhere between 0.66156 (lici_1) and 0.88978 (uici_1).
II.
Although at first glance, the relationship between age and price of Nissan Zs appears to be linear in the age range from 2 to 7 years, it is definitely not so in the age range from 2 to 11 years (see graph, below). From Auto Trader we obtained the data on age and price for a sample of 31 Nissan Zs shown in the table below. (Ages are in years, prices are in hundreds of dollars.) Below the table is a scatterplot of the data. Age x 5 6 4 2 5 6 3 4 Price y 85 70 90 150 98 95 129 115 Age x 4 4 6 3 6 2 6 8 Price y 103 100 75 140 66 169 60 50 Age x 10 5 10 5 6 4 6 9 Price y 25 82 35 89 95 65 82 42 Age x 3 9 9 11 1 5 5 Price y 135 44 36 33 180 80 105
As you can see from the scatterplot, the data points are not clustered about a straight line but instead follow a curve. This means we should not determine a regression line but instead should try to fit a curve to the data. From the curvature of the scatter diagram, it appears that a parabola might be an appropriate curve to fit to the data. To fit a parabola to the data, we need a regression equation of the form y = a + b1 x + b2 x 2 . If we let x1 = x and x2 = x2 , then the above equation becomes y = a + b1 x + b2 x 2 , which is a multiple regression equation with two predictor variables, age and age2 (square of the age).
a. Create a scatterplot to check Assumption 1 as well as to identify outliers and potential influential observations. b. Determine the quadratic regression equation for the data. c. Compute and interpret the coefficient of multiple determination, R2 . d. At the 10% significance level, determine if the model is useful for predicting the response. e. Obtain the residuals and create residual plots. Decide whether or not it is reasonable to consider that the assumptions for quadratic regression analysis are met by the variables in questions. f. At the 10% significance level, does it appear that the quadratic term can be removed from the full model as unnecessary (thus reducing the model to a linear model)? g. Obtain a point estimate for the mean price of 8-year-old Nissan Zs. h. Test the alternative hypothesis that the mean price of 8-year-old Nissan Zs is greater than $4000 (y = 40 hundred dollars). i. Determine a 90% confidence interval for the mean price of 8-year-old Nissan Zs. j. Find the predicted price of an 8-year-old Nissan Z for sale by Bob Smith. k. Determine a 90% prediction interval for the price of an 8-year-old Nissan Z for sale by Bob Smith. 1. 2. Enter the values of the two variables into SPSS and create a scatterplot. Since we want to predict the price of 8-year-old Nissan Zs, enter the number 8 in the age variable column of the data window after the last row. Enter a . for the corresponding price variable value (this lets SPSS know that we want a prediction for this value and not to include the value in any other computations). (See figure, below.)
3.
Create a variable for the age2 values. Select Transform Compute (see figure, left). For Target Variable enter Age_sq. For Numeric Expression enter age ** 2. Now press OK to create the new variable.
4.
Select Analyze
Regression
5.
Select Price as the dependent variable and Age and Age-sq as the independent variables, and select the Enter method. Click Statistics, select Estimates and Confidence intervals for the regression coefficients, select Model fit, and click Continue. Click Plots, select Normal Probability Plot, and click Continue. Click Save, select Unstandardized predicted values and S.E. of mean predictions ( sy ), select Unstandardized and Studentized residuals, select Mean and Individual prediction intervals at the 90% level (or whatever level the problem requires), and click Continue. Click OK. (See the four figures, following.)
The output from this procedure is extensive and will be shown in parts in the following answers.
a.
Create a scatterplot to check Assumption 1 as well as to identify outliers and potential influential observations.
Nissan Z
200 180 160 140 120 100 80 60 40 20 0 2 4 6 8 10 12
Age (Years)
The graph appears to have a curvilinear pattern that appears to be a simple curve; thus, a quadratic regression model will be fit to the data (Assumption 1 is met). There are no visible potential outliers or influential observations (no points away from the main cluster of points). b. Determine the quadratic regression equation for the data.
Coefficients a Standar dized Coefficients Beta -1.955 1.064 t 18.237 -7.587 4.131 Sig. .000 .000 .000 Unstandardized Coefficients Model 1 (Constant) Age (Years) AGE_SQ B 209.440 -30.776 1.330 Std. Error 11.484 4.056 .322 95% Confidence Interval for B Lower Bound 185.916 -39.085 .670 Upper Bound 232.965 -22.467 1.989
The quadratic regression equation is: y = 209.440 30.776x + 1.330x 2 . c. Compute and interpret the coefficient of multiple determination, R2 .
Model Summary b Adjusted R Square .896 Std. Error of the Estimate 12.81 Model 1 R .950a R Square .903
The coefficient of multiple determination is 0.903; therefore, about 90.3% of the variation in the price of Nissan Zs is explained by its quadratic relationship with the age of the car. The regression equation appears to be very useful for making predictions since the value of R 2 is close to 1.
d.
At the 10% significance level, determine if the model is useful for predicting the response.
ANOVA Model 1 Sum of Squares 42895.358 4594.836 47490.194 df 2 28 30
b
F 130.698
Sig. .000a
Step 1 :
Hypotheses
H 0 : 1 = 2 = 0 H a : at least one i 0
Step 2 :
Significance Level
= 0.10
Rejection Region
Since p-value < 0.001 0.10, we shall reject the null hypothesis. At the = 0.10 level of significance, there exists enough evidence to conclude that at least one of the terms of the quadratic model is useful for predicting the price of Nissan Zs; therefore the model us useful.
State conclusion in words
e.
Obtain the residuals and create residual plots. Decide whether or not it is reasonable to consider that the assumptions for quadratic regression analysis are met by the variables in questions. The residuals [res] and standardized values [sre] (as well as the predicted values [pre], the standard errors of each prediction [sep], the prediction interval endpoints [lici & uici], and the confidence interval endpoints [lmci & umci]) can be found in the data window.
.75
2 1
.50
0 -1
.25
-2 -3
22
-4
20
40
60
80
100
120
140
Age (Years)
AGE_SQ
The normal plot of the residuals shows the points close to a diagonal line; thus, Assumption 2 is satisfied. Each of the studentized residual plots shows a random scatter of points with constant variability; thus, Assumption 3 is met. Note that observation #22 has an unusual residual in each of the studentized residual plots (the residual is more that three standard deviations from the mean residual of 0). This indicates that observation #22 might be an outlier. Also, at first glance one might think that the variability is less for the right half of the plots when compared to the left half. This is likely not the case, and any apparent decrease in variability is probably due to the fact that there are far fewer observation in the right half (fewer values leave less room for variability).
f.
At the 10% significance level, does it appear that the quadratic term can be removed from the full model as unnecessary (thus reducing the model to a linear model)?
Coefficients a Standar dized Coefficients Beta -1.955 1.064 t 18.237 -7.587 4.131 Sig. .000 .000 .000 Unstandardized Coefficients Model 1 (Constant) Age (Years) AGE_SQ B 209.440 -30.776 1.330 Std. Error 11.484 4.056 .322 95% Confidence Interval for B Lower Bound 185.916 -39.085 .670 Upper Bound 232.965 -22.467 1.989
Step 1 :
Hypotheses
H 0 : 2 = 0 (quadratic term is not useful for predicting price) H a : 2 0 (quadratic term is useful for predicting price)
assuming that the linear term is included in the model Significance Level Rejection Region Reject the null hypothesis if p-value 0.10. Test Statistic and p -value (see above) T = 4.131, p-value < 0.001 Conclusion Since p-value < 0.001 0.10, we shall reject the null hypothesis. State conclusion in words At the = 0.10 level of significance, there exists enough evidence to conclude that the slope of the quadratic term is not zero and, hence, that the quadratic term is useful (with the linear term) as a predictor of price for Nissan Zs.
= 0.10
The remaining parts will be completed using output from the Data window.
g.
Obtain a point estimate for the mean price of 8-year-old Nissan Zs. The point estimate (pre_1) is 48.33213 hundred dollars ($4833.21).
h.
Test the alternative hypothesis that the mean price of 8-year-old Nissan Zs is greater than $4000 (y = 40 hundred dollars). Test at the 10% significance level.
Step 1 : Hypotheses
H 0 : y = 40 H a : y > 40
Step 3 :
(when (when
x = 8 & x = 64 )
2
Step 2 :
Significance Level
= 0.10
x = 8 & x 2 = 64 )
Critical Value(s) and Rejection Region(s) Critical Value: t ,df = n( k +1) = t 0.10,df = 28 = t 80%CI ,df = 28 = 1.31 Reject the null hypothesis if T 1.31 (or if p-value 0.10). Test Statistic
Step 4 :
T=
Step 5 : Step 6 :
Conclusion Since 2.4678 1.31 (0.005 < p-value < 0.01 0.10), we shall reject the null hypothesis. State conclusion in words At the level of significance, there exists enough evidence to conclude that the mean price of 8-year-old Nissan Zs is greater than $4000.
=0.10
i.
Determine a 90% confidence interval for the mean price of 8-year-old Nissan Zs. We are 90% confident that the mean price of 8-year-old Nissan Zs is somewhere between 42.58846 (lmci_1) and 54.07581(umci_1) hundred dollars (or somewhere between $4258.85 and $5407.58).
j.
Find the predicted price of an 8-year-old Nissan Z for sale by Bob Smith. The predicted price (pre_1) of Bob Smiths Nissan Z is 48.33213 hundred dollars ($4833.21).
k.
Determine a 90% prediction interval for the price of an 8-year-old Nissan Z for sale by Bob Smith. We are 90% certain that the price of Bob Smiths 8-year-old Nissan Z will b e somewhere between 25.79608 (lici_1) and 70.86819 (uici_1) hundred dollars (or somewhere between $2579.61 and $7086.82).