Correlation and Regression
Correlation and Regression
Chapter
Correlation and Regression 12
What are Correlation and Regression?
Correlation and regression are statistical measurements that are used to give a
relationship between two variables. For example, suppose a person is driving an
expensive car then it is assumed that she must be financially well. To numerically
quantify this relationship, correlation and regression are used.
Correlation
What is a correlation?
A correlation reflects the strength and/or direction of the association between two or
more variables. Correlation can be defined as a measurement that is used to quantify the
relationship between variables.
A regression model is a mathematical equation that describes the relationship between two or
more variables. A simple regression model includes only two variables: one independent and one
dependent. The dependent variable is the one being explained, and the independent variable is the
one used to explain the variation in the dependent variable.
A (simple) regression model that gives a straight-line relationship between two variables is called
a linear regression model.
The two diagrams in Figure 13.1 show a linear and a nonlinear relationship between the
dependent variable food expenditure and the independent variable income. A linear relationship
between income and food expenditure, shown in Figure 13.1a, indicates that as income increases,
the food expenditure always increases at a constant rate. A nonlinear relationship between income
and food expenditure, as depicted in Figure 13.1b, shows that as income increases, the food
expenditure increases, although, after a point, the rate of increase in food expenditure is lower for
every subsequent increase in income.
Figure 13.1 Relationship between food expenditure and income. (a) Linear
relationship. (b) Nonlinear relationship.
2
Correlation and Regression Analysis
Both correlation and regression analysis are done to quantify the strength of the
relationship between two variables by using numbers. Graphically, correlation and
regression analysis can be visualized using scatter plots.
3
Regression analysis is used to determine the relationship between two variables such that
the value of the unknown variable can be estimated using the knowledge of the known
variables. The goal of linear regression is to find the best-fitted line through the data
points. For two variables, x, and y, the regression analysis can be visualized as follows:
The best way to conduct correlation and regression analysis is by using Pearson's
correlation coefficient and by adopting the method of least squares respectively. The
correlation and regression formula is given below:
4
Difference between Correlation and Regression
Correlation and regression are both used as statistical measurements to get a good understanding
of the relationship between variables. If the correlation coefficient is negative (or positive) then
the slope of the regression line will also be negative (or positive). The table given below
highlights the key difference between correlation and regression.
Correlation Regression
Correlation is used to determine whether Regression is used to numerically describe how a
variables are related or not. dependent variable changes with a change in an
independent variable
Correlation tries to establish a linear It finds the best-fitted regression line to estimate an
relationship between variables. unknown variable on the basis of the known
variable.
The variables can be used The variables cannot be interchanged.
interchangeably
Correlation uses a signed numerical Regression is used to show the impact of a unit
value to estimate the strength of the change in the independent variable on the dependent
relationship between the variables. variable.
The Pearson's coefficient is the best The least-squares method is the best technique to
measure of correlation. determine the regression line.
Example: We can say that age and height can be described using a linear regression
model. Since a person’s height increases as age increases, they have a linear
relationship.
Regression models are commonly used as statistical proof of claims regarding everyday
facts.
What are the different types of regression models?
There are three different types of regression models:
1. Linear
2. Non-linear
3. Multiple
Let’s look at them in detail:
5
Linear regression model
A linear regression model is used to depict a relationship between variables that are
proportional to each other. Meaning, that the dependent variable increases/decreases
with the independent variable.
In the graphical representation, it has a straight linear line plotted between the variables.
Even if the points are not exactly in a straight line (which is always the case) we can still
see a pattern and make sense of it.
For example, as the age of a person increases, the level of glucose in their body
increases as well.
To know that a non-linear regression model is the best fit for your scenario, make sure
you look into your variables and their patterns. If you see that the response variable is
showing not-so-constant output to the input variable, you can choose to use a non-linear
model for your problem.
For example, a patient’s response to treatment can be good or bad depending on their
body’s tendency and willpower.
For example, the chances of a student failing their test can be dependent on various
input variables like hard work, family issues, health issues, etc.
What is stepwise regression modeling?
6
The analyst may add the remaining inputs one after the other based on their significance
and the extent to which it affects the target variable.
For example, vegetable prices have increased in a certain areas. The reason behind the
event can be anything from natural calamities to transport and supply chain
management. When an analyst decides to put it out on a graph, he will pick up the most
obvious reason, heavy rainfall in the agricultural regions.
Once the model is built, he can then add the rest of the affecting input variables into the
picture based on their occurrence and significance.
7
EXAMPLE 13–1
Find the least squares regression line for the data on incomes and food
expenditures on the seven households given in Table 13.1. Use income as an
independent variable and food expenditure as a dependent variable.
8
9
N.B. SS is the Sum of squre
10
Solution
(a) Based on theory and intuition, we expect the insurance premium to
depend on driving experience. Consequently, the insurance premium is a
dependent variable and driving experience is an independent variable in the
regression model. A new driver is considered a high risk by the insurance
companies, and he or she has to pay a higher premium for auto insurance. On
average, the insurance premium is expected to decrease with an increase in
the years of driving experience. Therefore, we expect a negative relationship
between these two variables. In other words, both the population correlation
coefficient and the population regression slope B are expected to be
negative.
(b) Table 13.5 shows the calculation of x, y, xy, x2, and y2.
11
12
13
14
15