Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
16 views

Lesson 3-Multiple Linear Regression

Uploaded by

Innoj Maco
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Lesson 3-Multiple Linear Regression

Uploaded by

Innoj Maco
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

ENGINEERING DATA ANALYSIS

University of Southeastern Philippines


COLLEGE OF ENGINEERING
Obrero, Davao City

MATH 212
ENGINEERING DATA ANALYSIS

DALIA M. RECONALLA, Ph.D


August 2020

1|Page
ENGINEERING DATA ANALYSIS

Faculty Information:

Name: Dalia M. Reconalla


Email: dalia.reconalla@usep.edu.ph
Contact Number: 0906-209-6611
Office: College of Engineering
Contact Number: (082) 224-3334
Consultation Hours: By appointment - may be arranged through:
 Official email
 Facebook messenger/Facebook group chat
 Text or call

Getting help

For academic concerns (College/Adviser - Contact details)


For administrative concerns (College Dean - Contact details)
For UVE concerns (KMD - Contact details)
For health and wellness concerns (UAGC, HSD and OSAS - Contact
details)

2|Page
ENGINEERING DATA ANALYSIS

TABLE OF CONTENTS

CONTENTS PAGE

Cover page ………………………………… 1

Faculty Information ……………………………….... 2

Table of Contents ………………………………… 3

Lesson 3 ………..……………………………….. 14

Application 3…………………………………………. 15

Module Summary ………………………………… ... 16

Module Assessment ……………………………….. 17

References ………………………………………….. 18

F-Distribution Table …………………………………. 19

3|Page
ENGINEERING DATA ANALYSIS

Learning Outcome:

o Estimate the value of the response variable from a given value of independent
variable.
o Conduct test of hypothesis of the significance about the regression line.

Time Frame: Week 13

Introduction

In most research problem where regression analysis is applied, more than one
independent variable is needed in the regression model. The complexity of most
scientific mechanism is such that in order to be able to predict an important response,
a multiple regression model is needed. When this model is linear in the coefficients, it
is called multiple linear regression model.

Activity

Given a simple linear regression model y = 30.04 + 0.897x , for the


intelligence test score x and the freshmen Math 121 grades y of group of engineering
students. What could be the grade of a student randomly selected with intelligence
test score of 75?

Analysis

Sketch the graph and the regression line and interpret the model.

Abstraction

From Lesson 2(Simple Linear Regression) we learned that when there is one
independent variable or predictor, the regression equation for predicting y from x is

The simultaneous use of two or more independent variables in predicting a dependent


variable is called multiple regression.

When there are two independent variables,

4|Page
ENGINEERING DATA ANALYSIS

̂
where:
̂ = the predicted value
a = the y-intercept
the expected change in y when changes one unit and remains
constant,
value of the first independent variable,
the expected change in y when changes one unit and remains
constant, and
= the value of the second independent variable.
i = number of observations

The equation for two independent variables can be extended to any number of
independent variables, say, k, such as , the mean of y│
( read as y given ) is given by the multiple
regression model:

(k = number of independent variables)


and the estimated response is obtained from the sample regression equation

̂ +... +

Where each regression coefficient is estimated by from the sample data


using the method of least square.

Estimating Coefficients

We shall obtain the least squares estimators of the parameters ,..., by


fitting the multiple linear regression model

To the data points {( n > k},


where is the observed response to values of the k
independent variables

E satisfies the equation

or

where and are the random error and residual, respectively, associated with
the response

In using the concept of least squares to arrive at the estimates ,


we minimize the expression

5|Page
ENGINEERING DATA ANALYSIS

SSE = ∑ ∑ .

Differentiating SSE in turn with respect to and equating to


zero, we generate the set of k + 1 normal equations:

∑ ∑ ∑ ∑

∑ ∑ ∑ ∑

. . . . .
. . . . .
. . . .
∑ ∑ ∑ ∑

These equations can solve for by any appropriate method for


solving systems of linear equations and further using vector-matrix approach. But
this method is a tedious process, hence estimating coefficients requires the use of
computer program.

This time, we will only consider two independent variables as example, for ease of
computation using algebraic manual computations where:

̂ where regression coefficients are


determined from the system of equation:

∑ ∑ ∑

∑ ∑ ∑ ∑

∑ ∑ ∑ ∑

Example 1: The average monthly electric power consumption (y) at a certain


manufacturing plant is considered to be linearly dependent on the ambient
temperature ( and the number of working days in a month ( Consider a one
year data given in the table.
a. Determine the least-squares estimates of the associated linear regression
coefficients. Find the regression equation representing the average electric power
consumption in terms of ambient temperature and number of working days in a
month.
b. Estimate the average monthly electric power consumption if the plants average
ambient temperature is 48 and the number of working days in a month is 22.

6|Page
ENGINEERING DATA ANALYSIS

Solution: Formulating the equation of regression coefficients:


n = 2 , i = 12. From the system of equations:

Solving for the values of the variables (presented in the table)

Formulating the equation of regression coefficients: n = 2 , i = 12, substituting the


values of the variables:

7|Page
ENGINEERING DATA ANALYSIS

Thus, the system of equation of regression coefficients is:

Find the values of using algebraic method, determinants, or matrices.

The estimated regression equation based on the data represented by the equation

Interpretation:

For every unit change in the ambient temperature, there correspond a 0.39 increase
in average the monthly electric power consumption, holding the number of working
days in a month constant. Likewise, for every increase in the working days in a month
by the company, there is a 10.80 increase in the average monthly power
consumption holding the ambient temperature constant.

b. Estimate the average monthly electric power consumption if the plants average
ambient temperature is 48 and the number of working days in a month is 22.

From the equation ̂ where and

̂
= 222.48

Properties of the Least Squares Estimator

For the linear regression equation


y=x

an unbiased estimate of variance is given by the error or residual mean square

8|Page
ENGINEERING DATA ANALYSIS

where
∑ ∑ ̂

The sum of squares identity

∑ ̅ =∑ ̂ ̅ +∑ ̂ continues to hold.

Sum of squares identity

SST = SSR + SSE

with SST = ∑ ̅ = total sum of squares

and SSR = ∑ ̂ ̅ regression sum of squares.


There are k degrees of freedom associated with SSR and , as always,

SST has n – 1 degrees of freedom.

Inference in Multiple Regression

In the multiple regression analysis, the response variable is described as a function of


more than one predictor variable. Therefore, there are several types of inferences that
can be made using this model. In the simple linear model studied earlier, the test for
the slope (t-test) is equivalent to the test for the utility of the model (F-test). However,
in the multiple regression they differ on the account of having more than one slope
parameter.

A Test of Model Adequacy

To find a statistic that measures how well a multiple regression model fits a set of
data, we use the multiple regression equivalent of , the coefficient of determination
for the straight-line model. Thus, we define the multiple coefficient of determination
, as

Thus, we define the multiple coefficient of determination , as

∑ ̂
∑ ̅
=1

where: ̂ = the predicted value using the underlying model.


= the fraction of the sample variation of the y values
(measured by SSyy) that is explained by the least-squares prediction equation.

Thus, = 0 implies a complete lack of fit of the model to the data, and
= 1 implies a perfect fit, with the model passing through every data point.

In general, , and the larger the value of , the better the model fits the data.

9|Page
ENGINEERING DATA ANALYSIS

The fact that is a sample statistic implies that it can be used to make inference
about the utility of the entire model for predicting the population of y values at each
setting of the independent variables.

Testing the Utility of Multiple Regression Model: The Global F-Test

: At least one of the parameters is nonzero.

F=

Rejection Region: F >

Using the p-value approach: Reject if p value < , where the

p-value = P(F(k, n – [k + 1] > F ).

Conditions:

1. The error component is normally distributed.


2. The mean of is zero.
3. The errors associated with different observations are independent.

Analysis of Variance (ANOVA)

The analysis of variance table for multiple regression problem provides a test of the
null hypothesis

which implies that response variable is not related to any of the k


input variables.

Analysis of Variance (ANOVA) Table

10 | P a g e
ENGINEERING DATA ANALYSIS

The tail values of the F-distribution are given in Tables . The F-test statistic becomes
large as the coefficient of determination becomes large.

To determine how large F must be before we can conclude at a


given significance level that the model is useful for predicting y, we set up the
rejection region (RR) as F > (k, n – [k + 1]).

Example 2. From the given data in Example 1, decide, at the 5% significance level,
whether the data provide sufficient evidence to conclude that the ambient temperature
and the monthly consumption and the number of working days in a month (predictor
variables) are useful for predicting the average monthly power consumption(response
variable).

Solution:

Step 1. State the null and alternative hypotheses.

of the parameters is nonzero.

Step 2 Decide on the significance level, α.


Perform the hypothesis test at the 5% significance level, or α = 0.05.

Step 3. Compute the value of the test statistic (F)

k = 2, n = 12
SSE = ∑ ̂ = 2004.7456

11 | P a g e
ENGINEERING DATA ANALYSIS

SST = ∑ ̅ = 6707.667
Finding

∑ ̂
∑ ̅
=1

=1

So

= = 10.56 .

We can also find F using the formula:

F=

Solving for the means

MSR = = = 2351.4607
MSE = = 222.7495

F=

Step 4. Decide whether to accept or to reject

Compare

Since = 4.2565¸

(4.2565) (Refer to the F Distribution Table –Appendix


Table 1 in this module)

Since (4.2565), we reject the null hypothesis and


conclude that at 5% level of significance, there is sufficient evidence to support that
the ambient temperature and the number of working days in a month can be used to
predict the average monthly power consumption. Likewise , it can be concluded that
average monthly power consumption is linearly related to either ambient temperature
or number of working days in a month or both.

Multiple Correlation

The correlation between y and the combined predictors x1, x2, . . . , xk is


called the coefficient of multiple correlation and is denoted by , or
simply R.

12 | P a g e
ENGINEERING DATA ANALYSIS

The dot after y in the notation separates the dependent variable, y, from the
independent variables, x1, x2, . . . , xk.

For the two predictor case, is given by

where
and are correlation coefficients for the respective variables.
The multiple regression coefficient can assume values from 0 to 1, where 0 indicates
the absence of a linear multiple correlation between y and the independent variables
and 1 indicates a perfect linear multiple correlation in which all of the observed y’s
fall on the regression plane.

Coefficient of Multiple Determination. The proportion of variance in y


accounted for by the combined predictors x1, x2, . . . , xk is obtained by squaring
the multiple correlation coefficient and is called the coefficient of multiple
determination, R2.

This coefficient is an extension of the coefficient of determination for one predictor,


r2 discussed in Lesson 1.

For , =

For , =

For , =√

A comparison of the value of R2 with that for r2 indicates the improvement in


predicting y that can be achieved by using a multiple regression equation instead of a
one-predictor regression equation.

Table 1. Intercorrelation among the Variables

Variable
Variable y
y 1.000

1.000

1.000

13 | P a g e
ENGINEERING DATA ANALYSIS

The coefficient of multiple determination will be relatively large when the correlation
of each of the predictors with y is large and the correlations among the predictors are
0 or very small.

In fact, if the independent variables are uncorrelated,

If correlations exist among some or all of the independent variables, it is usually the
case that

The presence of nonzero correlations among the independent variables is referred to


as multicollinearity.

Extreme multicollinearity occurs when one independent variable is a linear function


of other independent variables; for example, x2 might equal 3x1, or might equal

In the latter case, the inclusion of in the regression equation would not account for
any variance in y not already accounted for by and Ideally, you would like to
have predictors that have high correlations with the dependent variable and zero
correlations with each other. Unfortunately in the behavioral sciences, health sciences,
and education, it is difficult to find predictors that meet these criteria. Once you have
found three or four good predictors, it is often difficult to find additional predictors
that are not highly correlated with at least one of the original predictors.

Application

1. Construct a table showing the intercorelation among the variables, average


monthly electric power consumption (y) , the ambient temperature ( and the
number of working days in a month ( in Example 1 and determine the
multiple coefficient of determination . Verify if there exist a multicollinearity
between the dependent and the independent variables.

Closure
Congratulations! You have successfully completed the tasks and activities
for Lesson 3. It is expected that your knowledge about correlation and regression will
surely help you in solving other real life problems or practical applications involving
predictions or estimation.
You are almost done with this module. The module summary and assessment will
follow.

14 | P a g e
ENGINEERING DATA ANALYSIS

SUMMARY

o A measure of the degree of linear relationship is called


correlation coefficient, r.

 The value of r is a measure of the extent to which x and y are


linearly related
 The value of r does not depend on the unit of
measurement for either variable
 The value of r does not depend on which of the two
variables is considered x.
 The value of r is between 1 and +1.
 A correlation coefficient of r =1 occurs only when all
the points in a scatterplot of the data lie exactly on a
straight line that slopes upward. Similarly, r = 1 only
when
all the points lie exactly on a downward-sloping line.

o Regression Analysis is a statistical technique used for


determining the functional form of the relationship between
two or more variables, where one variable is called the
dependent or response variable and the rest are called the
independent or explanatory variables.

o The coefficient of determination, denoted by r 2, gives the


proportion of variation in y that can be attributed to an
approximate linear relationship between x and y.

15 | P a g e
ENGINEERING DATA ANALYSIS

MODULE ASSESSMENT

Solve the following problems.

1. Regression methods were used to analyze the data from a study investigating
the relationship between roadway surface temperature (x) and pavement
deflection ( y). The data follow :

Given the data above:


(a)estimate the intercept and slope regression coefficients. Write
the estimated regression line.
(b) Find the standard error of the slope and intercept coefficients.
(c) Compute the coefficient of determination, 2. Comment on the value.
(d) Use a t-test to test for significance of the intercept and slope coefficients
at = 0.05.
(e) Draw the regression line

2. The article “How to Optimize and Control the Wire Bonding Process”
described an experiment carried out to assess the impact of the variables

and temperature(degree Celsius), on ball bond shear strength(gm),y. The


following data were generated:

16 | P a g e
ENGINEERING DATA ANALYSIS

a. Find the regression equation representing the ball bond shear strength in terms
force and temperature.
b. Determine whether the data provide sufficient evidence to conclude that the
force and temperature are useful for predicting the ball bond shear strength at 5%
level of significance.

17 | P a g e
ENGINEERING DATA ANALYSIS

References

Broto, A.S. (2007). Simplified Approach to Inferential Statistics(1st ed.). National .


Philippines.

Carambas, Zenaida U(2011). Basic probability and Statistics. Valencia Educational


Supply. Baguio City

Peck, R., Olsen, C. and Devore, J.L. (2012): Introduction to Statistics and Data
Analysis(4th edition). Brooks/Cole/Cengage Learning, 20 Channel
Center Street Boston, MA 02210, USA

Ott, R.L., Longnecker, M. (2010). An Introduction to Statistical Methods and Data


Amalysis(6th ed). Brooks/Cole, Cengage Learning, CA, USA.

Raussas, George(2003). Introduction to Probability and Statistical Inference.


Elseviere Science, USA

Walpole, RE, & Myers, RH.(1993). Probability and Statistics for Engineers and (5th
ed.). Macmillan Publishing Company, New York.

Weiss, N.A. (2012). Elementary Statistics (8th ed.)Addison-Wesley. Pearson


Education, Inc. Boston, MA.

Woodbury, George(2002): An Introduction to Statistics(1st ed.) Thomson Learning,


Inc. Thomson Learning, USA

18 | P a g e
ENGINEERING DATA ANALYSIS

Appendix Table 1. F Distribution

19 | P a g e
ENGINEERING DATA ANALYSIS

20 | P a g e
ENGINEERING DATA ANALYSIS

21 | P a g e
ENGINEERING DATA ANALYSIS

22 | P a g e
ENGINEERING DATA ANALYSIS

23 | P a g e
ENGINEERING DATA ANALYSIS

24 | P a g e

You might also like