0% found this document useful (0 votes)

13 views

Simple and Multiple Linear Regression

notes BA

Uploaded by

suparshjain63

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Simple and Multiple Linear Regression

notes BA

Uploaded by

suparshjain63

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

>

Linear Regression
Regression analysis is used to create a model that describes the relationship between a dependent variable and one or more
independent variables. Depending on whether there are one or more independent variables, a distinction is made between simple
and multiple linear regression analysis.

In the case of a simple linear regression, the aim is to examine the influence of an independent variable on one dependent variable.
In the second case, a multiple linear regression, the influence of several independent variables on one dependent variable is
analyzed.

In linear regression, an important prerequisite is that the scale of measurement of the dependent variable is scalar or metric and a
normal distribution exists. If the dependent variable is categorical, a logistic regression is used. You can easily perform a regression
analysis in the linear regression calculator here on DATAtab.

Example: Simple Linear Regression

Does the height have an influence on the weight of a person?

Example: Multiple Linear Regression

Do the height and gender have have an influence on the weight of a person?

 Dependent variable
 Independent variables

Simple Linear Regression

The goal of a simple linear regression is to predict the value of a dependent variable based on an independent variable. The greater
the linear relationship between the independent variable and the dependent variable, the more accurate is the prediction. This goes
along with the fact that the greater the proportion of the dependent variable's variance that can be explained by the independent
variable is, the more accurate is the prediction. Visually, the relationship between the variables can be shown in a scatter plot. The
greater the linear relationship between the dependent and independent variables, the more the data points lie on a straight line.

The task of simple linear regression is now to exactly determine this straight line which best describes the linear relationship between
the dependent and independent variable. In linear regression analysis, a straight line is drawn in the scatter plot. To determine this
straight line, linear regression uses the method of least squares.

The regression line can be described by the following equation:

Definition of "Regression coefficients":

a : the point of intersection with the y-axis

b : the gradient of the straight line

is the respective estimate of the y-value. This means that for each x-value the corresponding y-value is estimated. In our example,
this means that the height of people is used to estimate their weight.

If all points (measured values) were exactly on one straight line, the estimate would be perfect. However, this is almost never the case
and therefore, in most cases a straight line must be found, which is as close as possible to the individual data points. The attempt is
thus made to keep the error in the estimation as small as possible so that the distance between the estimated value and the true
value is as small as possible. This distance or error is called the "residual" and is abbreviated as "e" (error).

When calculating the regression line, an attempt is made to determine the regression coefficients (a and b) so that the sum of the
squared residuals is minimal. (OLS- "Ordinary Least Squares")

The regression coefficient b can now have different signs, which can be interpreted as follows

b > 0: there is a positive correlation between x and y (the greater x, the greater y)
b < 0: there is a negative correlation between x and y (the greater x, the smaller y)
b = 0: there is no correlation between x and y

Standardized regression coefficients are usually designated by the letter "beta". These are values that are comparable with each
other. Here the unit of measurement of the variable is no longer important. The standardized regression coefficient (beta) is
automatically output by DATAtab.

Coefficient of determination and standard estimation error

In order to find out how well the regression model can predict or explain the dependent variable, two main measures are used. This
is on the one hand the coefficient of determination R2 and on the other hand the standard estimation error. The coefficient of
determination R2, also known as the variance explanation, indicates how large the portion of the variance is that can be explained by
the independent variables. The more variance can be explained, the better the regression model is. In order to calculate R2, the
variance of the estimated value is related to the variance in the observed values:
Variance of the predicted values

Variance of the observed values

The standard estimation error is the standard deviation of the estimation error. This gives an impression of how much the prediction
differs from the correct value. Graphically interpreted, the standard estimation error is the dispersion of the observed values around
the regression line.

The coefficient of determination and the standard estimation error are used for simple and multiple linear regression.

Multiple Linear Regression

Unlike simple linear regression, multiple linear regression allows more than two independent variables to be considered. The goal is
to estimate a variable based on several other variables. The variable to be estimated is called the dependent variable (criterion). The
variables that are used for the prediction are called independent variables (predictors).

Multiple linear regression is frequently used in empirical social research as well as in market research. In both areas it is of interest to
find out what influence different factors have on a variable. For example, what determinants influence a person's health or purchasing
behavior?

Marketing example:
For a video streaming service you should predict how many times a month a person streams videos. For this you get a record of
paste visitor data (age, income, gender, ...).

Medical example:
You want to find out which factors have an influence on the cholesterol level of patients. For this purpose, you analyze a patient
data set with cholesterol level, age, hours of sport per week and so on.

The equation necessary for the calculation of a multiple regression is obtained with k dependent variables to

The coefficients can now be interpreted similarly to the linear regression equation. If all independent variables are 0, the resulting
value is a. If an independent variable changes by one unit, the associated coefficient indicates by how much the dependent variable
changes. So if the independent variable xi increases by one unit, the dependent variable y increases by bi.

Multivariate Regression vs. Multiple Regression

Multiple regression should not be confused with multivariate regression. In the former case, the influence of several independent
variables on a dependent variable is examined. In the second case, several regression models are calculated to allow conclusions
to be drawn about several dependent variables. Consequently, in a multiple regression, one dependent variable is taken into
account, whereas in a multivariate regression, several dependent variables are analyzed.

Assumptions of Linear Regression

In order to interpret the results of the regression analysis meaningfully, certain conditions must be met.

Linearity: There must be a linear relationship between the dependent and independent variables.
Homoscedasticity: The residuals must have a constant variance.
Normality: Normally distributed error
No Multicollinearity: No high correlation between the independent variables

Linearity
In linear regression, a straight line is drawn through the data. This straight line should represent all points as good as possible. If the
points are distributed in a non-linear way, the straight line cannot fulfill this task.

In the upper left graph, there is a linear relationship between the dependent and the independent variable, here the regression line
can be meaningfully put in. In the right graph you can see that there is a clearly non-linear relationship between the dependent and
the independent variable. Therefore it is not possible to put the regression line through the points in a meaningful way. Since this is
not possible, the coefficients cannot be meaningfully interpreted by the regression model or there could be errors in the prediction
that are greater than thought.

Therefore ist is important to check beforehand, whether a linear relationship between the dependent variable and each of the
independent variables exists. This is usually checked graphically.

Homoscedasticity
Since in practice the regression model never exactly predicts the dependent variable, there is always an error. This very error must
have a constant variance over the predicted range.
To test homoscedasticity, i.e. the constant variance of the residuals, the dependent variable is plotted on the x-axis and the error on
the y-axis. Now the error should scatter evenly over the entire range. If this is the case, homoscedasticity is present. If this is not the
case, heteroskedasticity is present. In the case of heteroscedasticity, the error has different variances, depending on the value range
of the dependent variable.

Normal distribution of the error

The next requirement of linear regression is that the error epsilon must be normally distributed. There are two ways to find it out:
One is the analytical way and the other is the graphical way. In the analytical way, you can use either the Kolmogorov-Smirnov test or
the Shapiro-Wilk test. If the p-value is greater than 0.05, there is no deviation of the data from the normal distribution and one can
assume that the data are normally distributed.

However, these analytical tests are used less and less because they tend to attest normal distribution for small samples and become
significant very quickly for large samples, thus rejecting the null hypothesis that the data are normally distributed. Therefore, the
graphical variant is increasingly used.

In the graphical variant, either the histogram is looked at or, even better, the so-called QQ-plot or Qunatil-quantile-plot. The more
the data lie on the line, the better the normal distribution.

Multicollinearity
Multicollinearity means that two or more independent variables are strongly correlated with one another. The problem with
multicollinearity is that the effects of each independent variable cannot be clearly separated from one another.

If, for example, there is a high correlation between x1 and x2, then it is difficult to determine b1 and b2. If both are e.g. completely
equal, the regression model does not know how large b1 and how large b2 should be. Therefore the regression model becomes
unstable.
This is of course not tragic if the regression model is only used for a prediction; in the case of a prediction, one is only interested in
the prediction, but not in how great the influence of the respective variables is. However, if the regression model is used to measure
the influence of the independent variables on the dependent variable, there must not be . If multicollinearity exists, the coefficients
cannot be interpreted meaningfully.

More detailed information about multicollinearity can be found here

Significance test and Regression

The regression analysis is often carried out in order to make statements about the population based on a sample. Therefore, the
regression coefficients are calculated using the data from the sample. To rule out the possibility that the regression coefficients are
not just random and have completely different values in another sample, the results are statistically tested with significance test.
This test takes place at two levels.

Significance test for the whole regression model

Significance test for the regression coefficients

It should be noted, however, that the requirements in the previous section must be met.

Significance test for the regression model

Here it is checked whether the coefficient of determination R2 in the population differs from zero. The null hypothesis is therefore
that the coefficient of determination R2 in the population is zero. To confirm or reject the null hypothesis, the following F-test is
calculated

The calculated F-value must now be compared with the critical F-value. If the calculated F-value is greater than the critical F-value,
the null hypothesis is rejected and the R2 deviates from zero in the population. The critical F-value can be read from the F-
distribution table. The numerator degrees of freedom are k and the denominator degrees of freedom are n-k-1.

Significance test for the regression coefficients

The next step is to check which variables have a significant contribution to the prediction of the dependent variable. This is done by
checking whether the slopes (regression coefficients) also differ from zero in the population. The following test statistics are
calculated in order to analyze it

where bj is the j. regression coefficient and sb_j is the standard error of bj. This test statistic is t-distributed with the degrees of
freedom n-k-1. The critical t-value can be read from the t-distribution table.

Example Linear Regression

As an example of linear regression, a model is set up that predicts the body weight of a person. The dependent variable is thus the
body weight, while the height, age and gender are chosen as independent variables. The following example data set is available:

weight height age gender

79 1.80 35 male
69 1.68 39 male
73 1.82 25 male
95 1.70 60 male
82 1.87 27 male
55 1.55 18 female
69 1.50 89 female
71 1.78 42 female
64 1.67 16 female
69 1.64 52 female

After you have copied your data into the statistics calculator, you must select the variables that are relevant for you. Then you receive
the results in table form.
Interpretation of the results
This table shows that 75.4% of the variation in weight can be determined by height, age and sex. The model estimates an average of
6.587 when predicting the weight of a person. The regression equation results in

Weight = 47,379 · Height + 0,297 · Age + 8,922 · is_male -24.41

The equation shows for example, that if the age increases by one year, the weight increases by 0.297 kg according to the model. In
the case of the dichotomous variable sex, the slope is to be interpreted as the difference, according to the model a man weighs 8.922
kg more than a woman. If all independent variables are zero, the result is a weight of -24.41.

The standardized coefficients beta are measured seperately and always range between -1 and +1. The greater beta is, the greater is
the contribution of each independent variable to explain the dependent variable. In this regression analysis, the variable age has the
greatest influence on the variable weight.

The calculated coefficients refer to the sample used for the calculation by the regression analysis, so it is of interest whether the B-
values deviate from zero only by chance or whether they are also different from zero in the population. For this purpose, the null
hypothesis is formulated that the respective calculated B value is equal to zero in the population. If this is the case, it means that the
respective dependent variable has no significant influence on the dependent variable.

The sigma value indicates whether a variable has a significant influence. Sigma values smaller than 0.05 are considered as
significant. In this example, only age can be considered as a significant predictor of the weight of a person.

Market Mix Modeling Using R
No ratings yet
Market Mix Modeling Using R
10 pages
Investigating Variables
No ratings yet
Investigating Variables
15 pages
Regression and Introduction To Bayesian Network
No ratings yet
Regression and Introduction To Bayesian Network
12 pages
ida unit-3.rtf
No ratings yet
ida unit-3.rtf
34 pages
Regression and Classification
No ratings yet
Regression and Classification
26 pages
Linear Regression
100% (2)
Linear Regression
28 pages
QT _Unit 2_Part B - Regression
No ratings yet
QT _Unit 2_Part B - Regression
40 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
BRM Multivariate Notes
No ratings yet
BRM Multivariate Notes
22 pages
1 PDF
No ratings yet
1 PDF
3 pages
1.5.Linear Regression
No ratings yet
1.5.Linear Regression
5 pages
UNIT II Regression
No ratings yet
UNIT II Regression
59 pages
Linear Regression Basic Interview Questions
No ratings yet
Linear Regression Basic Interview Questions
36 pages
Unit - II_DA
No ratings yet
Unit - II_DA
22 pages
Econometrics Unit 4
No ratings yet
Econometrics Unit 4
56 pages
Correlation and Regression
No ratings yet
Correlation and Regression
3 pages
CHAPTER 14 Regression Analysis
No ratings yet
CHAPTER 14 Regression Analysis
69 pages
Regression Modelling
No ratings yet
Regression Modelling
25 pages
Multiple linear regression
No ratings yet
Multiple linear regression
39 pages
Unit-III (Data Analytics)
100% (1)
Unit-III (Data Analytics)
15 pages
Econometrics 2
No ratings yet
Econometrics 2
27 pages
Econometrics
No ratings yet
Econometrics
18 pages
Linear Regression 1
No ratings yet
Linear Regression 1
14 pages
Unit II-II
No ratings yet
Unit II-II
21 pages
Unit1 - Data Science - SPPU
No ratings yet
Unit1 - Data Science - SPPU
15 pages
Relationship- Correlation and Regression (1)
No ratings yet
Relationship- Correlation and Regression (1)
42 pages
ML Unit-III Notes
No ratings yet
ML Unit-III Notes
83 pages
Examining Relationships in Quantitative Research
No ratings yet
Examining Relationships in Quantitative Research
9 pages
Review: I Am Examining Differences in The Mean Between Groups
100% (1)
Review: I Am Examining Differences in The Mean Between Groups
44 pages
Predictive Modelling Using Linear Regression: © Analy Datalab Inc., 2016. All Rights Reserved
No ratings yet
Predictive Modelling Using Linear Regression: © Analy Datalab Inc., 2016. All Rights Reserved
16 pages
DA-MODULE-3
No ratings yet
DA-MODULE-3
54 pages
Regression Analysis
100% (2)
Regression Analysis
11 pages
BRM Multivariate Notes
No ratings yet
BRM Multivariate Notes
22 pages
Bio2 Module 4 - Multiple Linear Regression
No ratings yet
Bio2 Module 4 - Multiple Linear Regression
20 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
Regression
No ratings yet
Regression
3 pages
Machine Learning Algorithm
100% (2)
Machine Learning Algorithm
20 pages
regression
No ratings yet
regression
7 pages
MLT Unit 2
No ratings yet
MLT Unit 2
53 pages
Regression PDF
No ratings yet
Regression PDF
16 pages
Linear - Regression & Evaluation Metrics
No ratings yet
Linear - Regression & Evaluation Metrics
31 pages
1_UNIT 2 2 files merged
No ratings yet
1_UNIT 2 2 files merged
80 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
13 pages
Correlation
No ratings yet
Correlation
5 pages
Multiple Regression: Curve Estimation
100% (2)
Multiple Regression: Curve Estimation
23 pages
Multivariate Research Assignment
No ratings yet
Multivariate Research Assignment
6 pages
ML Unit 2
No ratings yet
ML Unit 2
27 pages
Linear Regression
No ratings yet
Linear Regression
10 pages
Regression Anaysis Explaination Lecture Notes by Dr. Wahid Sherani
No ratings yet
Regression Anaysis Explaination Lecture Notes by Dr. Wahid Sherani
7 pages
Data Analytics Unit 2
No ratings yet
Data Analytics Unit 2
13 pages
Correlation Anad Regression
No ratings yet
Correlation Anad Regression
13 pages
DSC 402
No ratings yet
DSC 402
14 pages
Regression
No ratings yet
Regression
3 pages
Independent Variables
No ratings yet
Independent Variables
3 pages
MODULE-3
No ratings yet
MODULE-3
34 pages
Corr_Regression Analysis
No ratings yet
Corr_Regression Analysis
19 pages
Summarize The Methods of Studying Correlation.: Module - 3
No ratings yet
Summarize The Methods of Studying Correlation.: Module - 3
17 pages
Linear Regression Assumptions and Limitations
No ratings yet
Linear Regression Assumptions and Limitations
10 pages
Level 2 Quants Notes
No ratings yet
Level 2 Quants Notes
7 pages
Econometrics
No ratings yet
Econometrics
13 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
A Study On Factors Affecting Customer Satisfaction: A Summer Internship Report
No ratings yet
A Study On Factors Affecting Customer Satisfaction: A Summer Internship Report
48 pages
Thesis
No ratings yet
Thesis
73 pages
Models of Population Growth: Malthus'S
No ratings yet
Models of Population Growth: Malthus'S
17 pages
Analyze A/B Test Results: #We Are Setting The Seed To Assure You Get The Same Answers On Quizzes As We Set Up
No ratings yet
Analyze A/B Test Results: #We Are Setting The Seed To Assure You Get The Same Answers On Quizzes As We Set Up
12 pages
Chapter 18
No ratings yet
Chapter 18
44 pages
PS4 Solutions
No ratings yet
PS4 Solutions
17 pages
Download Full Making Sense of Data A Practical Guide to Exploratory Data Analysis and Data Mining 1st Edition Glenn J. Myatt PDF All Chapters
No ratings yet
Download Full Making Sense of Data A Practical Guide to Exploratory Data Analysis and Data Mining 1st Edition Glenn J. Myatt PDF All Chapters
45 pages
Firm Chair
No ratings yet
Firm Chair
13 pages
Csit418 PDF
No ratings yet
Csit418 PDF
414 pages
Problem Set 2 Quantitative Methods UNIGE
No ratings yet
Problem Set 2 Quantitative Methods UNIGE
10 pages
Download full (Ebook) Absolute Risk: Methods and Applications in Clinical Management and Public Health by Ruth M. Pfeiffer, Mitchell H. Gail ISBN 9781466561656, 1466561653 ebook all chapters
100% (9)
Download full (Ebook) Absolute Risk: Methods and Applications in Clinical Management and Public Health by Ruth M. Pfeiffer, Mitchell H. Gail ISBN 9781466561656, 1466561653 ebook all chapters
55 pages
14 Rajdeep Singh
No ratings yet
14 Rajdeep Singh
7 pages
Solution To Homework 7
No ratings yet
Solution To Homework 7
4 pages
Hernán 2019 A Second Chance To Get Causal Inference Right, A Classification of Data Science Tasks
No ratings yet
Hernán 2019 A Second Chance To Get Causal Inference Right, A Classification of Data Science Tasks
9 pages
Donnelly, Iyer & Howell, 2012
No ratings yet
Donnelly, Iyer & Howell, 2012
15 pages
Introduction to High Dimensional Statistics 2nd Edition Christophe Giraud - Download the full ebook now for a seamless reading experience
No ratings yet
Introduction to High Dimensional Statistics 2nd Edition Christophe Giraud - Download the full ebook now for a seamless reading experience
49 pages
Gokcekus-Ekici2020 Article ReligionReligiosityAndCorrupti
No ratings yet
Gokcekus-Ekici2020 Article ReligionReligiosityAndCorrupti
19 pages
What Statistical Analysis Should I Use
No ratings yet
What Statistical Analysis Should I Use
3 pages
EnPI V5.0 Algorithm Document
No ratings yet
EnPI V5.0 Algorithm Document
14 pages
Business Ethics Research Paper
No ratings yet
Business Ethics Research Paper
14 pages
Department of Tropical Agriculture and International Cooperation
No ratings yet
Department of Tropical Agriculture and International Cooperation
2 pages
Final
No ratings yet
Final
8 pages
Application of Logistic Regression Model in Mental Health Evaluation: A Case Study of Farmers Working Under High Voltage Transmission Line (HVTL)
No ratings yet
Application of Logistic Regression Model in Mental Health Evaluation: A Case Study of Farmers Working Under High Voltage Transmission Line (HVTL)
18 pages
Data Science and Gen AI
No ratings yet
Data Science and Gen AI
27 pages
ECS4863 SOLUTIONS Activity 4.2 A Reviews of Statistical and Econometric Concepts Continued
No ratings yet
ECS4863 SOLUTIONS Activity 4.2 A Reviews of Statistical and Econometric Concepts Continued
3 pages
BC2406 S01 G02 Final Report
No ratings yet
BC2406 S01 G02 Final Report
33 pages
Liver Disease Prediction Using Machine Learning
0% (1)
Liver Disease Prediction Using Machine Learning
5 pages
Introduction To Structural Equation Modeling Using Stata: University College London October 16, 2019
No ratings yet
Introduction To Structural Equation Modeling Using Stata: University College London October 16, 2019
127 pages
The Effect of Adding Enzyme Modified Cheese
No ratings yet
The Effect of Adding Enzyme Modified Cheese
7 pages