Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
7 views

Linear Regression

The document discusses linear regression, including its introduction, formulas, examples of simple and multiple linear regression, properties, coefficients, and types. Linear regression is used to model relationships between variables, determine predictor strength, and forecast effects. Types covered include simple, multiple, polynomial, discriminant, and logistic regression.

Uploaded by

maida maryam
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Linear Regression

The document discusses linear regression, including its introduction, formulas, examples of simple and multiple linear regression, properties, coefficients, and types. Linear regression is used to model relationships between variables, determine predictor strength, and forecast effects. Types covered include simple, multiple, polynomial, discriminant, and logistic regression.

Uploaded by

maida maryam
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Subject: Probability and Statistics

BSCS — 3-A
Department of Computer Science
Bahria University, Lahore Campus

Group 4
Linear Regression
Formulas for regression line
Examples

Linear Regression
Introduction
The term “regression” and the methods for investigating the relationships between two variables may
date back to about 100 years ago. It was first introduced by Francis Galton in 1908, the renowned British
biologist, when he was engaged in the study of heredity.

Linear regression is a basic and commonly used type of predictive analysis. The overall idea of regression
is to examine two things:

(I) Does a set of predictor variables do an excellent job in predicting an outcome (dependent)
variable?
(II) Which variables are significant predictors of the outcome variable, and how do they–indicated
by the magnitude and sign of the beta estimates–impact the outcome variable?

It is a statistical method used to model the relationship between a dependent variable (often denoted as
"y") and one or more independent variables (often denoted as "x"). It assumes that there is a linear
relationship between the independent variable(s) and the dependent variable.

The goal of linear regression is to find the equation of a straight line that best fits the data points. This
equation is typically represented as:

y = β0 + β1 𝑥 + 𝜀

Where:

 (y) is the dependent variable.


 (x) is the independent variable.
 (β1) is the gradient or slope of the line, indicating the relationship between (x) and (y).
 (β0) is the y-intercept, the value of (y) when (x) is zero.
 ( 𝜀) is a random error. It is usually assumed that error 𝜀is normally distributed with E( 𝜀) = 0 and
a constant variance Var( 𝜀) = σ2 in the simple linear regression.

Multiple Linear Regression


In the case of multiple independent variables, the equation becomes a linear combination:

y = β0 + β1 𝑥 1+ … + βp 𝑥 p + 𝜀

Where:

 y is the dependent variable.


 x1, x2, …, xn are independent variables.
 Β0, β1, β2, …, βp are regression coefficients.
 Error 𝜀 follows the normal distribution with E( 𝜀) = 0 and a constant variance Var( 𝜀) = σ2.

Simple linear regression is to investigate the linear relationship between one dependent variable and
one independent variable, while the multiple linear regression focuses on the linear relationship
between one dependent variable and more than one independent variables. The multiple linear
regression involves more issues than the simple linear regression such as collinearity, variance inflation,
graphical display of regression outlier and influential observation.

Non-Linear Regression
The non-linear regression model (growth model) may be written as

𝛼
𝑦= 𝛽𝑡
+𝜀
1+𝑒

Linear regression aims to find the best-fitting line by minimizing the difference between the actual
values and the predicted values (often done through a method called ordinary least squares). This
method calculates the coefficients (slope and intercept) that minimize the sum of the squared
differences between the actual and predicted values.

Properties of Linear Regression


For the regression line where the regression parameters β0 and β1 are defined, the following properties
are applicable:
 The regression line reduces the sum of squared differences between observed values and
predicted values.
 The regression line passes through the mean of X and Y variable values.
 The regression constant β0 is equal to the y-intercept of the linear regression.
 The regression coefficient β1 is the slope of the regression line. Its value is equal to the average
change in the dependent variable (Y) for a unit change in the independent variable (X).

Regression Coefficient
The regression coefficient is given by the equation:

y = β0 + β1x

Where:

 β0 is a constant.
 β1 is the regression coefficient.

Formula

Given below is the formula to find the value of the regression coefficient.

Σ [ ( 𝑥𝑖− 𝑥) ( 𝑦𝑖− 𝑦 )]
β1 =b1 =
Σ [( 𝑥𝑖 − 𝑥) 2]

Where:

 xi and yi are the observed data sets.


 x and y are the mean value.

Types of Linear Regression


Linear regression can be categorized into a few several types based on the number of independent
variables and the nature of the relationship:

 Simple Linear Regression:

This involves a single independent variable (interval or ratio or dichotomous) used to


predict the dependent variable (interval or ratio). The equation takes the form (y = mx + b),
where there is one predictor variable (x).

Example: Consider a scenario where you want to predict the price of a house based on
its area (in square feet). You collect data on house prices and their corresponding areas. The
simple linear regression model would look like this:

House Price = β0 + β1 × Area

Here, the dependent variable is the house price, and the independent variable is the area.

 Multiple Linear Regression:

In this type, there are multiple independent variables used to predict the dependent
variable. The equation becomes (y = b + m_1x_1 + m_2x_2 + ... + m_nx_n), where there are (n)
predictor variables (x_1, x_2, ..., x_n).

Example: Imagine you are predicting a student's final exam score based on several
factors such as hours studied, previous test scores, and attendance. The multiple linear
regression equation would be:

Final Exam Score = β0 + β1 × Hours Studied + β2 × Previous Test Score+ β3 × Attendance.

In this case, there are three independent variables: hours studied, previous test score, and
attendance.

 Polynomial Regression:

It is a form of linear regression where the relationship between the independent


variable and the dependent variable is modeled as a nth-degree polynomial. For instance, (y = b
+ m_1x + m_2x^2 + m_3x^3 + ...).

Example: Let's say you are studying the relationship between temperature and air
conditioning electricity consumption. You suspect the relationship might not be linear but could
be better represented by a quadratic equation. The polynomial regression equation might look
like:

Electricity Consumption = β0 + β1 × Temperature + β2 × Temperature²

Here, the quadratic term Temperature² allows for a curved relationship between temperature
and electricity consumption.

 Discriminant Regression:
Discriminant regression, sometimes referred to as discriminant least squares regression
or linear discriminant analysis with regression, represents an approach that combines elements
of both discriminant analysis and regression techniques.

Example: Let's say you want to predict whether a patient has a specific disease based on
their age, blood pressure, and cholesterol levels. Instead of using traditional logistic regression
(which is specifically designed for classification tasks), you try a regression model where you
predict the probability of having the disease using linear regression. You might create a model
that estimates the probability of having the disease based on the predictor variables.

 Logistic Regression:

Despite the name, logistic regression is a type of regression used for classification
problems rather than regression problems. It models the probability of a binary outcome by
using a logistic function. It is called "regression" because it estimates the relationship between
one dependent binary variable and one or more independent variables.

Example: Suppose you are predicting whether a customer will buy a product based on
their age and income. The logistic regression equation would estimate the probability of buying
the product given their age and income:

1
Probability of Buying = − ( 𝛽0 + 𝛽1 × 𝐴𝑔𝑒+ 𝛽2 × 𝐼𝑛𝑐𝑜𝑚𝑒 )
1+ 𝑒
Here, the dependent variable is binary (bought or did not buy), and the independent variables
are age and income.

These variations and types accommodate different scenarios and complexities within datasets, allowing
for more flexible modeling and analysis of relationships between variables.

Uses of Linear Regression


Three major uses for regression analysis are

1. Determining the strength of Predictors:

The regression might be used to identify the strength of the effect that the independent
variable(s) have on a dependent variable. Typical questions are what the strength of
relationship between dose and effect is, sales and marketing spending, or age and
income.

2. Forecasting an Effect:

It can be used to forecast effects or impact of changes. That is, the regression analysis
helps us to understand how much the dependent variable changes with a change in one
or more independent variables. A typical question is, “how much additional sales
income do I get for each additional $1000 spent on marketing?”
3. Trend Forecasting:

Regression analysis predicts trends and future values. The regression analysis can be
used to get point estimates. A typical question is, “what will the price of gold be in 6
months?”.

Examples
1. Parent’s Height and Children’s Height
The following classical data set contains the information of Parent’s height and Children’s height.

Parent (x) 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5 72.5
Children (y) 65.8 66.7 67.2 67.6 68.2 68.9 69.5 69.9 72.2

First, calculate the mean of x and y:

( 64.5+65.5+66.5+ 67.5+68.5+69.5+70.5+71.5+ 72.5 )


x=
9
x = 68.5

( 65.8+66.7+67.2+ 67.6+68.2+68.9+69.5+ 69.9+72.2 )


y=
9
y = 68.4

Now, calculate the slope using formula


9
Σ 𝑖=1 ( 𝑥 𝑖 − 𝑥 ) ( 𝑦 𝑖 − 𝑦 )
β1 =
Σ 9𝑖=1 ( 𝑥 𝑖 − 𝑥 ) 2
9
Σ 𝑖=1 ( 64.5 −68.5 )( 65.8 − 68.4 ) +…+ ( 72.5− 68.5 ) ( 72.2− 68.4 )
β1 = 2 2
Σ 9𝑖=1 ( 64.5− 68.5 ) +…+ ( 72.5 − 68.5 )
10.4+…+ 15.2 25.6
β1 = =
16 +…+16 32
β1 = 0.8

Next, find the y-intercept β0 using formula

β0 = 𝑦 − 𝛽1 × 𝑥

β0 = 68.4 − 0.8× 68.5

β0 = 13.6
Therefore, the linear regression equation for the relationship between parent and children heights is
approximately:

𝐶h𝑖𝑙𝑑𝑟𝑒𝑛 ’ 𝑠 𝐻𝑒𝑖𝑔h𝑡=13.6+ 0.8× 𝑃𝑎𝑟𝑒𝑛𝑡 ′ 𝑠 𝐻𝑒𝑖𝑔h𝑡


This equation represents the best-fit line that models the relationship between parent heights and
children's heights based on the provided data.

2. Hours Studied and Test Score


Suppose we have a dataset that contains the number of hours students' study and their corresponding
scores on a test.

Here is a small set of data:

Hours Studied (x) 2 3 4 5 6


Test Scores (y) 60 70 75 80 85

First, calculate the means of x and y:

( 2+ 3+4 +5+6 )
x= =4
5
( 60+70+75+80+ 85 )
y= = 74
5
Now, calculate the slope β1:

( 2− 4 )( 60 − 74 ) +…+ ( 6 −4 ) ( 85 −74 )
β1 =
( 2 − 4 )2 +…+ ( 6 − 4 )2
50
β1 = = 6.25
8
Next, find the y-intercept (β0)

β0 = 74 – 6.25 * 4

β0 = 49

Therefore, the linear regression equation is

𝑇𝑒𝑠𝑡 𝑆𝑐𝑜𝑟𝑒=50+6 × 𝐻𝑜𝑢𝑟𝑠 𝑆𝑡𝑢𝑑𝑖𝑒𝑑


This equation allows us to predict test scores based on the number of hours studied.

You might also like