Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
106 views

Module 3 - Regression and Correlation Analysis

This document provides an overview of regression analysis and correlation. It defines key terms like response variable, predictor variable, simple linear regression, multiple linear regression, and ordinary least squares regression. It also discusses assumptions that should be met for OLS regression like linearity, zero mean residuals, uncorrelated residuals, constant variance, and normal distribution of residuals. The document explains how to interpret the slope and intercept of a regression line and categorize variables. It provides examples of using regression to model relationships.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views

Module 3 - Regression and Correlation Analysis

This document provides an overview of regression analysis and correlation. It defines key terms like response variable, predictor variable, simple linear regression, multiple linear regression, and ordinary least squares regression. It also discusses assumptions that should be met for OLS regression like linearity, zero mean residuals, uncorrelated residuals, constant variance, and normal distribution of residuals. The document explains how to interpret the slope and intercept of a regression line and categorize variables. It provides examples of using regression to model relationships.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 54

Module 3 - Regression and

Correlation Analysis
Regression Analysis
• A regression analysis generates an equation to describe the statistical
relationship between one or more predictors and the response
variable and to predict new observations. Linear regression usually
uses the ordinary least squares estimation method which derives the
equation by minimizing the sum of the squared residuals.
• Example:, you work for a potato chip company that is analyzing
factors that affect the percentage of crumbled potato chips per
container before shipping (response variable - Y). You are conducting
the regression analysis and include the percentage of potato relative
to other ingredients and the cooking temperature (Celsius) as your
two predictors (x)
What is simple linear regression?
• Simple linear regression examines the linear relationship between
two continuous variables: one response (y) and one predictor (x).
When the two variables are related, it is possible to predict a
response value from a predictor value with better than chance
accuracy.
• Regression provides the line that "best" fits the data. This line can
then be used to:
Examine how the response variable changes as the predictor variable
changes.
Predict the value of a response variable (y) for any predictor variable (x).
What is multiple linear regression?
• Multiple linear regression examines the linear relationships
between one continuous response and two or more
predictors.
• If the number of predictors is large, then before fitting a
regression model with all the predictors, you should use
stepwise or best subsets model-selection techniques to
screen out predictors not associated with the responses.
What is ordinary least squares regression?
• In ordinary least squares (OLS) regression, the estimated equation is
calculated by determining the equation that minimizes the sum of the
squared distances between the sample's data points and the values
predicted by the equation.

• Response vs. Predictor


• With one predictor (simple linear regression), the sum of the squared distances from
each point to the line are as small as possible.
Assumptions that should be met for OLS
regression
• OLS regression provides the most precise, unbiased estimates only
when the following assumptions are met:
• The regression model is linear in the coefficients. Least squares can
model curvature by transforming the variables (instead of the
coefficients). You must specify the correct functional form in order to
model any curvature.
• Quadratic Model
• Here, the predictor variable, X, is squared in order to model the curvature.
Y = bo + b1X + b2X2
• Residuals have a mean of zero. Inclusion of a constant in the model
will force the mean to equal zero.
• All predictors are uncorrelated with the residuals.
• Residuals are not correlated with each other (serial correlation).
• Residuals have a constant variance.
• No predictor variable is perfectly correlated (r=1) with a different
predictor variable. It is best to avoid imperfectly high correlations
(multicollinearity) as well.
• Residuals are normally distributed.
Slope and intercept of the regression line
• The slope indicates the steepness of a line and the intercept indicates
the location where it intersects an axis.
• The slope and the intercept define the linear relationship between
two variables, and can be used to estimate an average rate of change.
The greater the magnitude of the slope, the steeper the line and the
greater the rate of change.
• By examining the equation of a line, you quickly can discern its slope
and y-intercept (where the line crosses the y-axis).
The slope is positive 5.
When x increases by
1, y increases by 5.
The y-intercept is 2.

The slope is negative


0.4. When x increases
by 1, y decreases by
0.4. The y-intercept is
7.2
The slope is 0.
When x increases
by 1, y neither
increases or
decreases. The y-
intercept is -4.
What are categorical, discrete, and continuous
variables?
Quantitative variables can be classified as discrete or continuous.
•Categorical variable
Categorical variables contain a finite number of categories or distinct groups.
Categorical data might not have a logical order. For example, categorical predictors
include gender, material type, and payment method.
•Discrete variable
Discrete variables are numeric variables that have a countable number of values
between any two values. A discrete variable is always numeric. For example, the
number of customer complaints or the number of flaws or defects.
•Continuous variable
Continuous variables are numeric variables that have an infinite number of values
between any two values. A continuous variable can be numeric or date/time. For
example, the length of a part or the date and time a payment is received.
What are response and predictor
variables?
• Variables of interest in an experiment (those that are measured or observed) are
called response or dependent variables. Other variables in the experiment that
affect the response and can be set or measured by the experimenter are called
predictor, explanatory, or independent variables.
• For example, you might want to determine the recommended baking time for a
cake recipe or provide care instructions for a new hybrid plant.
Regression analyses for continuous
response variables
• Regression - Model the relationship between categorical or
continuous predictors and one response, and use the model to
predict response values for new observations. Easily include
interaction and polynomial terms, transform the response, or use
stepwise regression if needed.
In Minitab, choose Stat > Regression > Regression > Fit Regression Model.
Basic measures of association
• Correlation - Use to calculate Pearson's correlation or Spearman
rank-order correlation (also called Spearman's rho).
In Minitab, choose Stat > Basic Statistics > Correlation.

• Covariance - Use to calculate the covariance, a measure of the


relationship between two variables. The covariance is not
standardized, unlike the correlation coefficient.
In Minitab, choose Stat > Basic Statistics > Covariance.
Overview for Fitted Line Plot
• Use Fitted Line Plot to display the relationship between one
continuous predictor and a response. You can fit a linear, quadratic,
or cubic model to the data.
• A fitted line plot shows a scatterplot of the data with a regression line
representing the regression equation.
• For example, an engineer at a manufacturing site wants to examine
the relationship between energy consumption and the setting of a
machine used in the manufacturing process. The engineer thinks the
relationship between these variables is curvilinear. Therefore, the
engineer created a fitted line plot and fits a quadratic model to the
data.
• Where to find this analysis
• To create a fitted line plot, choose Stat > Regression >
Fitted Line Plot.
• When to use an alternate analysis
• If you have one categorical predictor and no continuous
predictors, use One-Way ANOVA.
• If you have more than one predictor, use Fit Regression
Model.
Assumption of Fitted Line Plots
• The data should include only one continuous predictor
• The response variable should be continuous
• Collect data using best practices
• To ensure that your results are valid, consider the following guidelines:
• Make certain that the data represent the population of interest.
• Collect enough data to provide the necessary precision.
• Measure variables as accurately and precisely as possible.
• Record the data in the order it is collected.
• The model should provide a good fit to the data
Determine how well the model fits your data
• R-sq
R² is the percentage of variation in the response that is explained by the
model. The higher the R² value, the better the model fits your data. R² is
always between 0% and 100%. R² always increases when you add additional
predictors to a model.
• R-sq (adj)
• Use adjusted R² when you want to compare models that have different
numbers of predictors. R² always increases when you add a predictor to the
model, even when there is no real improvement to the model. The adjusted
R² value incorporates the number of predictors in the model to help you
choose the correct model.
Consider the following when you compare
the R2 values:
• Small samples do not provide a precise estimate of the strength of
the relationship between the response and predictors. If you need
R2 to be more precise, you should use a larger sample (typically, 40
or more).
• R2 is just one measure of how well the model fits the data. Even when
a model has a high R2, you should check the residual plots to verify
that the model meets the model assumptions.
Determine whether your model meets the
assumptions of the analysis
• Use the residual plots to help you determine whether the model is
adequate and meets the assumptions of the analysis.
• If the assumptions are not met, the model may not fit the data well
and you should use caution when you interpret the results.
Residuals versus fits plot
• Use the residuals versus fits plot to verify the
assumption that the residuals are randomly
distributed and have constant variance. Ideally, the
points should fall randomly on both sides of 0, with
no recognizable patterns in the points.
The patterns in the following table may indicate
that the model does not meet the model
assumptions.
Residuals versus order plot
• Use the residuals versus order plot to verify the assumption that the
residuals are independent from one another. Independent residuals
show no trends or patterns when displayed in time order. Patterns in
the points may indicate that residuals near each other may be
correlated, and thus, not independent. Ideally, the residuals on the
plot should fall randomly around the center line:
The following types of patterns may
indicate that the residuals are dependent.

Trend
Cycle

Shift
Normal probability plot
• Use the normal probability plot of the residuals
to verify the assumption that the residuals are
normally distributed. The normal probability
plot of the residuals should approximately
follow a straight line.
The patterns in the following table may indicate
that the model does not meet the model
assumptions.
Example
• A materials engineer at a furniture manufacturing site wants to assess
the stiffness of the particle board that the manufacturer uses. The
engineer measures the stiffness and the density of a sample of
particle board pieces.
• The engineer uses simple regression to determine whether the
density of the particles is associated with the stiffness of the board
• Choose Stat > Regression >
Fitted Line Plot.
• In Response, enter Stiffness.
• In Predictor, enter Density.
• Click Options. Under Display
Options, select Display
confidence interval and Display
prediction interval. Click OK.
• Click Graphs. Under Residual
Plots, select Four in one.
• Click OK in each dialog.
Results Interpret the results of P-value
The p-value for the regression model is 0.000,
which means that the actual p-value is less than
0.0005. Because the p-value is less than the
significance level of 0.05, the engineer can
conclude that the association between stiffness
Regression equation and density is statistically significant.

Interpretation of Regression Equation


In these results, the coefficient for the predictor,
Density, is 3.541. The average stiffness of the
particle board increases by approximately 3.5 for
every 1 unit increase in density. The sign of the
coefficient is positive, which indicates that as
density increases, stiffness also increases.

Interpretation of R-sq
In these results, the density of the particle board
explains 84.5% of the variation in the stiffness of
Note: the higher the R-sq or R-sq(adj) the the boards. The R2 value indicates that the model
better the model fit. fits the data well.
Versus Fits Interpretation
In this residuals versus fits plot, the
Residual Plots for Stiffness
points appear randomly scattered on
Normal Probability Plot Versus Fits the plot. However, the point in the
99 30
upper right corner appears to be an
90 20
outlier.

Residual
Percent

10
50
0
10 Versus order Interpretation
-10
1
In this residuals versus order plot, the
-20 0 20 0 15 30 45 60
Residual Fitted Value
outlier that is also visible on the other
residual plots appears to correspond
Histogram Versus Order to the observation in row 21 of the
8 30
worksheet.
6 20
Frequency

Normal Probability Plot Interpretation


Residual

10
4
0 In this normal probability plot, the
2
-10 residuals generally appear to follow a
0
-10 0 10 20 2 4 6 8 10 12 14 16 18 20 22 24 26 28 straight line. However, the point in the
Residual Observation Order upper right corner of the plot is far
away from the line and appears to be
an outlier, which was also visible on
the other residual plots
Multiple Regression Model
Overview for Fit Regression Model
• Use Fit Regression Model to describe the relationship
between a set of predictors and a continuous response
using the ordinary least squares method. After you perform
the analysis
• Predict the response for new observations.
• Plot the relationships among the variables.
• Find values that optimize one or more responses.
• To fit a regression model, choose Stat > Regression > Regression > Fit
Regression Model.
Assumptions
• The predictors can be continuous or categorical - If you want to plot
the relationship between one continuous (numeric) predictor and a
continuous response.
• The response variable should be continuous
• Collect data using best practices
• The correlation among the predictors, also known as multicollinearity,
should not be severe
• The model should provide a good fit to the data
Example
• A research chemist wants to understand how several predictors are
associated with the wrinkle resistance of cotton cloth. The chemist
examines 32 pieces of cotton cellulose produced at different settings
of curing time, curing temperature, formaldehyde concentration, and
catalyst ratio. The durable press rating, a measure of wrinkle
resistance, is recorded for each piece of cotton.
• The chemist performs a multiple regression analysis to fit a model
with the predictors and eliminate the predictors that do not have a
statistically significant relationship with the response
• Choose Stat > Regression > Regression >
Fit Regression Model.
• In Responses, enter Rating.
• In Continuous predictors, enter Conc
Ratio Temp Time.
• Click Graphs.
• Under Residuals plots, choose Four in
one.
• In Residuals versus the variables, enter
Conc Ratio Temp Time.
• Click OK in each dialog box.
Results Interpretation
The predictors temperature, catalyst ratio, and
formaldehyde concentration have p-values that are less
than the significance level of 0.05. These results indicate
that these predictors have a statistically significant effect
on wrinkle resistance. The p-value for time is greater than
0.05, which indicates that there is not enough evidence to
conclude that time is related to the response. The chemist
may want to refit the model without this predictor.

Interpretation
In these results, the model explains approximately 73% of
the variation in the response.

Interpretation for VIF (Variance Inflation Factor)


In these results, the variance factor is less than 10,
therefore there is no multicollinearity in the model

There are some guidelines we can use to determine whether our VIFs
(Variance Inflation Factor) are in an acceptable range. A rule of thumb
commonly used in practice is if a VIF is < 10, is acceptable

Note: Multicollinearity means when two or more predictors in the


model are correlated and provide redundant information about the
response.
Residual Plots for Rating Versus Fits
In this residuals versus fits plot, the points
Normal Probability Plot Versus Fits
do not appear to be randomly distributed
99 2
about zero. There appear to be clusters of
90
1
points that could represent different

Residual
groups in the data. You should investigate
Percent

50
0 the groups to determine their cause.
10
-1 Versus Order
1
-2 -1 0 1 2 2 4 6 In this residuals versus order plot, the
Residual Fitted Value residuals do not appear to be randomly
distributed about zero. The residuals
Histogram Versus Order appear to systematically decrease as the
6.0 2
observation order increases. You should
4.5
investigate the trend to determine the
1
Frequency

cause.
Residual
3.0
0
1.5 Normal Probability Plot
0.0
-1
In this normal probability plot, the points
-1.0 -0.5 0.0 0.5 1.0 1.5 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
generally follow a straight line. There is no
Residual Observation Order
evidence of nonnormality, outliers, or
unidentified variables.
Correlation
• Use Correlation to measure the strength and direction of the association
between two variables. Minitab offers two methods of correlation: the
Pearson product moment correlation and the Spearman rank order
correlation. 
• The Pearson correlation (also known as r), which is the most common
method, measures the linear relationship between two continuous
variables.
• If you are not certain whether your variables are linearly related, you
should create a scatter plot. If the relationship between the variables is
not linear, you may be able to use the Spearman rank order correlation
(also known as Spearman's rho). The Spearman correlation measures the
monotonic relationship between two continuous or ordinal variables
Assumptions
• The data should be continuous or ordinal
• If you have categorical data, you should perform Cross Tabulation and Chi-
Square to examine the association between variables.
• The relationship between variables should be linear or monotonic
• If your variables do not have a linear or monotonic relationship, the results
from the correlation analysis will not accurately reflect the strength of the
relationship.
• Unusual values can have a strong effect on the results
• Because unusual values can have a strong effect on the results, use
Scatterplot or Fitted Line Plot to identify these values.
When to use Pearson’r and Spearman
rho?
• Pearson correlation coefficient is used when data is normally
distributed
• Spearman correlation coefficient is used when data is not normally
distributed
• Check the normality of data using Anderson Darling test to verify if
data is normally distributed meaning used the Test of Normality.
Examine the linear relationship between variables
(Pearson)
• Use the Pearson correlation coefficient to examine the strength and direction of the linear
relationship between two continuous variables.
• Strength
• The correlation coefficient can range in value from −1 to +1. The larger the absolute
value of the coefficient, the stronger the relationship between the variables.
• For the Pearson correlation, an absolute value of 1 indicates a perfect linear relationship.
A correlation close to 0 indicates no linear relationship between the variables.
• Direction
• The sign of the coefficient indicates the direction of the relationship. If both variables
tend to increase or decrease together, the coefficient is positive, and the line that
represents the correlation slopes upward. If one variable tends to increase as the other
decreases, the coefficient is negative, and the line that represents the correlation slopes
downward.
The following plots show data with specific correlation values to illustrate different
patterns in the strength and direction of the relationships between variables

Large positive
No relationship: relationship
Pearson r

Moderate
positive
relationship

The points fall close to the line, Large negative relationship


which indicates that there is a strong
negative relationship between the
variables. The relationship is
negative because, as one variable
increases, the other variable
decreases
Examine the monotonic relationship between
variables (Spearman)
• Strength
• The correlation coefficient can range in value from −1 to +1. The larger the
absolute value of the coefficient, the stronger the relationship between the
variables.
• Direction
• The sign of the coefficient indicates the direction of the relationship. If both
variables tend to increase or decrease together, the coefficient is positive,
and the line that represents the correlation slopes upward. If one variable
tends to increase as the other decreases, the coefficient is negative, and the
line that represents the correlation slopes downward.
The following plots show data with specific Spearman correlation coefficient values to illustrate
different patterns in the strength and direction of the relationships between variables

Strong positive
No relationship
relationship

Strong
negative
relationship
Example
• An engineer at an aluminum castings plant assesses the relationship
between the hydrogen content and the porosity of aluminum alloy
castings. The engineer collects a random sample of 14 castings and
measures the following properties of each casting: hydrogen content,
porosity, and strength.
• The engineer uses the Pearson correlation to examine the strength
and direction of the linear relationship between each pair of
variables.
• Choose Stat > Basic Statistics >
Correlation.
• In Variables, enter Hydrogen Porosity
Strength.
• Click OK.
Interpretation
Results The Pearson correlation coefficient between
hydrogen content and porosity is 0.625 and
represents a positive relationship between the
variables. As hydrogen increases, porosity also
increases. The p-value is 0.017, which is less than
the significance level of 0.05. The p-value indicates
that the correlation is significant.

The Pearson correlation coefficient between


hydrogen content and strength is −0.790 and the p-
value is 0.001. The p-value is less than the
significance level of 0.05, which indicates that the
correlation is significant. As hydrogen content
increases, strength tends to decrease. The Pearson
correlation coefficient between porosity and
strength is −0.527 and the p-value is 0.053. The p-
value is close to the significance level of 0.05, which
provides inconclusive evidence for the association
between porosity and strength.
Results Interpretation

In these results, the Spearman


correlation between porosity and
hydrogen is 0.590, which indicates that
there is a positive relationship between
the variables. The Spearman correlation
between strength and hydrogen is -0.859
and between strength and porosity is -
0.675. The relationship between these
variables is negative, which indicates
that as hydrogen and porosity increase,
strength decreases.
Test of Normality Results
Probability Plot of Porosity
Probability Plot of Hydrogen Normal
Normal
99
99
Mean 0.5471
Mean 0.2321 StDev 0.2124
StDev 0.02694 95 N 14
95 N 14 AD 0.347
AD 0.218 90
P-Value 0.427
90
P-Value 0.801
80
80
70
70

Percent
60
Percent

60
50
50
40
40
30
30
20
20

10 10

5 5

1 1
0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.0 0.2 0.4 0.6 0.8 1.0
Hydrogen Porosity
Probability Plot of Strength
Normal
99
Mean 0.5998

95
StDev
N
0.3311
14
Since the p-value is greater than 0.05,
90
AD 0.544
P-Value 0.132 the results showed that data are
normally distributed
80
70
Percent

60
50
40
30
20

10

1
0.0 0.4 0.8 1.2 1.6
Strength
Activity 4
Problem 1
• The rotations per minute (RPM) is critical to the quality of a
wind generator. Several components affect the RPM of a
particular generator. Among them, the weight of the fans, the
speed of the wind, and the pressure. After having designed the
Conakry model of a wind generator, the reliability engineer
wants to build a model that will show how the “Rotation”
variable relates to the “Wind,” “Pressure,” and “Weight”
variables.
a. Show that “Wind” and “Pressure” are highly correlated.
b. Show that “Rotation” is highly dependent on the input
factors.
c. Show that only “Weight” is significant in the equation.
d. Show that the VIF is too high for “Wind” and “Pressure.”
e. Interpret the probability plot for the residuals.
Problem 2
• Organophosphate (OP) compounds are used as
pesticides. However, it is important to study their
effect on species that are exposed to them. In the
laboratory study Some Effects of Organophosphate
Pesticides on Wildlife Species, by the Department of
Fisheries and Wildlife at Virginia Tech, an experiment
was conducted in which different dosages of a
particular OP pesticide were administered to 5
groups of 5 mice (peromysius leucopus). The 25 mice
were females of similar age and condition. One
group received no chemical. The basic response y
was a measure of activity in the brain. It was
postulated that brain activity would decrease with an
increase in OP dosage. The data are as follows:
• Determine the regression model and
interpret
• Construct an analysis-of-variance table and
interpret.
• Interpret the residual plots, R-sq.
• Test the correlation of the two variables
Problem 3
• The Statistics Consulting Center at Virginia
Tech analyzed data on normal
woodchucks for the Department of
Veterinary Medicine. The variables of
interest were body weight in grams and
heart weight in grams. It was desired to
• develop a linear regression equation in
order to determine if there is a significant
linear relationship between heart weight
and total body weight.
• Test the correlation of two variables
• Interpret the results
Problem 4
• An experiment was conducted to study the size of squid
eaten by sharks and tuna. The regressor variables are
characteristics of the beaks of the squid. The data are given
as follows:
• In the study, the regressor variables and response
considered are
x1 = rostral length, in inches,
x2 = wing length, in inches,
x3 = rostral to notch length, in inches,
x4 = notch to wing length, in inches,
x5 = width, in inches,
y = weight, in pounds.

Determine and interpret the following


•Regression equation, SE Coefficient, R-sq, VIF, Residual
Plot value

You might also like