Module 3 - Regression and Correlation Analysis
Module 3 - Regression and Correlation Analysis
Correlation Analysis
Regression Analysis
• A regression analysis generates an equation to describe the statistical
relationship between one or more predictors and the response
variable and to predict new observations. Linear regression usually
uses the ordinary least squares estimation method which derives the
equation by minimizing the sum of the squared residuals.
• Example:, you work for a potato chip company that is analyzing
factors that affect the percentage of crumbled potato chips per
container before shipping (response variable - Y). You are conducting
the regression analysis and include the percentage of potato relative
to other ingredients and the cooking temperature (Celsius) as your
two predictors (x)
What is simple linear regression?
• Simple linear regression examines the linear relationship between
two continuous variables: one response (y) and one predictor (x).
When the two variables are related, it is possible to predict a
response value from a predictor value with better than chance
accuracy.
• Regression provides the line that "best" fits the data. This line can
then be used to:
Examine how the response variable changes as the predictor variable
changes.
Predict the value of a response variable (y) for any predictor variable (x).
What is multiple linear regression?
• Multiple linear regression examines the linear relationships
between one continuous response and two or more
predictors.
• If the number of predictors is large, then before fitting a
regression model with all the predictors, you should use
stepwise or best subsets model-selection techniques to
screen out predictors not associated with the responses.
What is ordinary least squares regression?
• In ordinary least squares (OLS) regression, the estimated equation is
calculated by determining the equation that minimizes the sum of the
squared distances between the sample's data points and the values
predicted by the equation.
Trend
Cycle
Shift
Normal probability plot
• Use the normal probability plot of the residuals
to verify the assumption that the residuals are
normally distributed. The normal probability
plot of the residuals should approximately
follow a straight line.
The patterns in the following table may indicate
that the model does not meet the model
assumptions.
Example
• A materials engineer at a furniture manufacturing site wants to assess
the stiffness of the particle board that the manufacturer uses. The
engineer measures the stiffness and the density of a sample of
particle board pieces.
• The engineer uses simple regression to determine whether the
density of the particles is associated with the stiffness of the board
• Choose Stat > Regression >
Fitted Line Plot.
• In Response, enter Stiffness.
• In Predictor, enter Density.
• Click Options. Under Display
Options, select Display
confidence interval and Display
prediction interval. Click OK.
• Click Graphs. Under Residual
Plots, select Four in one.
• Click OK in each dialog.
Results Interpret the results of P-value
The p-value for the regression model is 0.000,
which means that the actual p-value is less than
0.0005. Because the p-value is less than the
significance level of 0.05, the engineer can
conclude that the association between stiffness
Regression equation and density is statistically significant.
Interpretation of R-sq
In these results, the density of the particle board
explains 84.5% of the variation in the stiffness of
Note: the higher the R-sq or R-sq(adj) the the boards. The R2 value indicates that the model
better the model fit. fits the data well.
Versus Fits Interpretation
In this residuals versus fits plot, the
Residual Plots for Stiffness
points appear randomly scattered on
Normal Probability Plot Versus Fits the plot. However, the point in the
99 30
upper right corner appears to be an
90 20
outlier.
Residual
Percent
10
50
0
10 Versus order Interpretation
-10
1
In this residuals versus order plot, the
-20 0 20 0 15 30 45 60
Residual Fitted Value
outlier that is also visible on the other
residual plots appears to correspond
Histogram Versus Order to the observation in row 21 of the
8 30
worksheet.
6 20
Frequency
10
4
0 In this normal probability plot, the
2
-10 residuals generally appear to follow a
0
-10 0 10 20 2 4 6 8 10 12 14 16 18 20 22 24 26 28 straight line. However, the point in the
Residual Observation Order upper right corner of the plot is far
away from the line and appears to be
an outlier, which was also visible on
the other residual plots
Multiple Regression Model
Overview for Fit Regression Model
• Use Fit Regression Model to describe the relationship
between a set of predictors and a continuous response
using the ordinary least squares method. After you perform
the analysis
• Predict the response for new observations.
• Plot the relationships among the variables.
• Find values that optimize one or more responses.
• To fit a regression model, choose Stat > Regression > Regression > Fit
Regression Model.
Assumptions
• The predictors can be continuous or categorical - If you want to plot
the relationship between one continuous (numeric) predictor and a
continuous response.
• The response variable should be continuous
• Collect data using best practices
• The correlation among the predictors, also known as multicollinearity,
should not be severe
• The model should provide a good fit to the data
Example
• A research chemist wants to understand how several predictors are
associated with the wrinkle resistance of cotton cloth. The chemist
examines 32 pieces of cotton cellulose produced at different settings
of curing time, curing temperature, formaldehyde concentration, and
catalyst ratio. The durable press rating, a measure of wrinkle
resistance, is recorded for each piece of cotton.
• The chemist performs a multiple regression analysis to fit a model
with the predictors and eliminate the predictors that do not have a
statistically significant relationship with the response
• Choose Stat > Regression > Regression >
Fit Regression Model.
• In Responses, enter Rating.
• In Continuous predictors, enter Conc
Ratio Temp Time.
• Click Graphs.
• Under Residuals plots, choose Four in
one.
• In Residuals versus the variables, enter
Conc Ratio Temp Time.
• Click OK in each dialog box.
Results Interpretation
The predictors temperature, catalyst ratio, and
formaldehyde concentration have p-values that are less
than the significance level of 0.05. These results indicate
that these predictors have a statistically significant effect
on wrinkle resistance. The p-value for time is greater than
0.05, which indicates that there is not enough evidence to
conclude that time is related to the response. The chemist
may want to refit the model without this predictor.
Interpretation
In these results, the model explains approximately 73% of
the variation in the response.
There are some guidelines we can use to determine whether our VIFs
(Variance Inflation Factor) are in an acceptable range. A rule of thumb
commonly used in practice is if a VIF is < 10, is acceptable
Residual
groups in the data. You should investigate
Percent
50
0 the groups to determine their cause.
10
-1 Versus Order
1
-2 -1 0 1 2 2 4 6 In this residuals versus order plot, the
Residual Fitted Value residuals do not appear to be randomly
distributed about zero. The residuals
Histogram Versus Order appear to systematically decrease as the
6.0 2
observation order increases. You should
4.5
investigate the trend to determine the
1
Frequency
cause.
Residual
3.0
0
1.5 Normal Probability Plot
0.0
-1
In this normal probability plot, the points
-1.0 -0.5 0.0 0.5 1.0 1.5 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
generally follow a straight line. There is no
Residual Observation Order
evidence of nonnormality, outliers, or
unidentified variables.
Correlation
• Use Correlation to measure the strength and direction of the association
between two variables. Minitab offers two methods of correlation: the
Pearson product moment correlation and the Spearman rank order
correlation.
• The Pearson correlation (also known as r), which is the most common
method, measures the linear relationship between two continuous
variables.
• If you are not certain whether your variables are linearly related, you
should create a scatter plot. If the relationship between the variables is
not linear, you may be able to use the Spearman rank order correlation
(also known as Spearman's rho). The Spearman correlation measures the
monotonic relationship between two continuous or ordinal variables
Assumptions
• The data should be continuous or ordinal
• If you have categorical data, you should perform Cross Tabulation and Chi-
Square to examine the association between variables.
• The relationship between variables should be linear or monotonic
• If your variables do not have a linear or monotonic relationship, the results
from the correlation analysis will not accurately reflect the strength of the
relationship.
• Unusual values can have a strong effect on the results
• Because unusual values can have a strong effect on the results, use
Scatterplot or Fitted Line Plot to identify these values.
When to use Pearson’r and Spearman
rho?
• Pearson correlation coefficient is used when data is normally
distributed
• Spearman correlation coefficient is used when data is not normally
distributed
• Check the normality of data using Anderson Darling test to verify if
data is normally distributed meaning used the Test of Normality.
Examine the linear relationship between variables
(Pearson)
• Use the Pearson correlation coefficient to examine the strength and direction of the linear
relationship between two continuous variables.
• Strength
• The correlation coefficient can range in value from −1 to +1. The larger the absolute
value of the coefficient, the stronger the relationship between the variables.
• For the Pearson correlation, an absolute value of 1 indicates a perfect linear relationship.
A correlation close to 0 indicates no linear relationship between the variables.
• Direction
• The sign of the coefficient indicates the direction of the relationship. If both variables
tend to increase or decrease together, the coefficient is positive, and the line that
represents the correlation slopes upward. If one variable tends to increase as the other
decreases, the coefficient is negative, and the line that represents the correlation slopes
downward.
The following plots show data with specific correlation values to illustrate different
patterns in the strength and direction of the relationships between variables
Large positive
No relationship: relationship
Pearson r
Moderate
positive
relationship
Strong positive
No relationship
relationship
Strong
negative
relationship
Example
• An engineer at an aluminum castings plant assesses the relationship
between the hydrogen content and the porosity of aluminum alloy
castings. The engineer collects a random sample of 14 castings and
measures the following properties of each casting: hydrogen content,
porosity, and strength.
• The engineer uses the Pearson correlation to examine the strength
and direction of the linear relationship between each pair of
variables.
• Choose Stat > Basic Statistics >
Correlation.
• In Variables, enter Hydrogen Porosity
Strength.
• Click OK.
Interpretation
Results The Pearson correlation coefficient between
hydrogen content and porosity is 0.625 and
represents a positive relationship between the
variables. As hydrogen increases, porosity also
increases. The p-value is 0.017, which is less than
the significance level of 0.05. The p-value indicates
that the correlation is significant.
Percent
60
Percent
60
50
50
40
40
30
30
20
20
10 10
5 5
1 1
0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.0 0.2 0.4 0.6 0.8 1.0
Hydrogen Porosity
Probability Plot of Strength
Normal
99
Mean 0.5998
95
StDev
N
0.3311
14
Since the p-value is greater than 0.05,
90
AD 0.544
P-Value 0.132 the results showed that data are
normally distributed
80
70
Percent
60
50
40
30
20
10
1
0.0 0.4 0.8 1.2 1.6
Strength
Activity 4
Problem 1
• The rotations per minute (RPM) is critical to the quality of a
wind generator. Several components affect the RPM of a
particular generator. Among them, the weight of the fans, the
speed of the wind, and the pressure. After having designed the
Conakry model of a wind generator, the reliability engineer
wants to build a model that will show how the “Rotation”
variable relates to the “Wind,” “Pressure,” and “Weight”
variables.
a. Show that “Wind” and “Pressure” are highly correlated.
b. Show that “Rotation” is highly dependent on the input
factors.
c. Show that only “Weight” is significant in the equation.
d. Show that the VIF is too high for “Wind” and “Pressure.”
e. Interpret the probability plot for the residuals.
Problem 2
• Organophosphate (OP) compounds are used as
pesticides. However, it is important to study their
effect on species that are exposed to them. In the
laboratory study Some Effects of Organophosphate
Pesticides on Wildlife Species, by the Department of
Fisheries and Wildlife at Virginia Tech, an experiment
was conducted in which different dosages of a
particular OP pesticide were administered to 5
groups of 5 mice (peromysius leucopus). The 25 mice
were females of similar age and condition. One
group received no chemical. The basic response y
was a measure of activity in the brain. It was
postulated that brain activity would decrease with an
increase in OP dosage. The data are as follows:
• Determine the regression model and
interpret
• Construct an analysis-of-variance table and
interpret.
• Interpret the residual plots, R-sq.
• Test the correlation of the two variables
Problem 3
• The Statistics Consulting Center at Virginia
Tech analyzed data on normal
woodchucks for the Department of
Veterinary Medicine. The variables of
interest were body weight in grams and
heart weight in grams. It was desired to
• develop a linear regression equation in
order to determine if there is a significant
linear relationship between heart weight
and total body weight.
• Test the correlation of two variables
• Interpret the results
Problem 4
• An experiment was conducted to study the size of squid
eaten by sharks and tuna. The regressor variables are
characteristics of the beaks of the squid. The data are given
as follows:
• In the study, the regressor variables and response
considered are
x1 = rostral length, in inches,
x2 = wing length, in inches,
x3 = rostral to notch length, in inches,
x4 = notch to wing length, in inches,
x5 = width, in inches,
y = weight, in pounds.